Abstract:
Model compression is an essential technique for reducing redundancy in deep
neural networks (DNNs), that enables their efficient deployment on a variety
of hardware ranging from GPU clusters in data centers to highly resource constrained processors in edge devices. Conventionally, compressing a DNN
requires creating policies that balance between size, speed and accuracy of the
network while considering a particular hardware. A policy is selected after
multiple trials and analysis on the domain space making the selection process
quite laborious, time-consuming and requiring human expertise. We propose
Auto Compress, which mainly utilizes Bayesian optimization to automatically
figure out the compression policy based on a combination of pruning and
quantization. The strategy of combining structured pruning, unstructured
pruning, and quantization is assigned at the beginning, based on the selected
hardware characteristics. The proposed, automatically learned compression
policy, outperforms hand-crafted policies providing more compression while
preserving accuracy. Applying our single-click compression to CIFAR10, on
Plain20, we achieve 66.2% FLOPs reduction while getting a Top1 accuracy of
88%, whereas applying our size-focused compression on ResNet-20 achieved
11.2x reduction in size without any loss of accuracy.