Abstract:
Convolutional Neural Networks (CNNs) have gained great success in the past decade
in solving challenging classification problems. In order to attain high accuracy, CNN
models have to do an excessive number of computations. These computations require
a maximum number of resources in terms of memory storage and computational
time. For this purpose, Graphical Processing Units (GPUs) are used for deploying
CNNs which improves the overall performance and reduces the computational complexity. However, implementation of CNNs on GPUs requires a capable processor
and takes much power which prevents its use in the application of reconfigurable
architectures. These models also consume maximum power and utilize a maximum
amount of energy for high computations, due to this maximum resource utilization
these models are hard to deploy on reconfigurable architectures and hence cannot
be used in resource-restricted edge devices. Therefore, there is a need to develop an
efficient strategy for the effective deployment of CNNs on edge devices or to use it
in reconfigurable architectures. CNNs have error tolerance behavior and can predict
the results using approximate values. By considering this point we have presented a
memory-efficient and resource-limited technique for the compression and optimization of convolutional neural networks. Our main goal is to reduce the computational
complexity and memory consumption of CNN architecture by preserving the model’s
overall accuracy. To accomplish the declared objective, we proposed a collaborative
network compression strategy where pruning based compression (PBC) is applied to
lower the computational complexity of the model. PBC actually pruned the weight
parameters of the layers in the network to achieve the maximum compression and
hence produced the compressed model. In the next step, the pruned model is then
divided into two sub-networks i.e., uniform or non-sparse network and sparse network. These networks are further optimized and compressed by using the proposed
optimization technique. Due to the uneven weight distribution, optimization of network by incremental quantization (ONIQ) is used to quantize the layers in a sparse
network. Similarly, the optimization of network by optimized quantization (ONOQ)
technique is proposed and utilized for the layers of a uniform network to quantize
the weight values up to optimal levels determined by the optimizer. An optimizer
is used to extract the best levels for the quantization of weight parameters. These
extracted optimal levels are used to obtain the greatest possible trade-off between
compression ratio and model accuracy. We have applied the proposed strategy
on LeNet-5 trained with the MNIST dataset, Cifar-Quick trained on the Cifar-10
dataset, and VGG-16 network trained with the ImageNet ILSVRC2012 dataset. We
outperformed state-of-the-art techniques by achieving a high compression ratio with
a very slight drop in accuracy.