Energy and Performance Efficient Inference for Neural Networks

Naeem, Mohammad Omer

DSpace Home
→
E-Theses
→
SEECS
→
Electrical Engineering
→
MS
→
View Item

Energy and Performance Efficient Inference for Neural Networks

Naeem, Mohammad Omer

URI: http://10.250.8.41:8080/xmlui/handle/123456789/37588

Date: 2019

Abstract:

Model compression is an essential technique for reducing redundancy in deep neural networks (DNNs), that enables their efficient deployment on a variety of hardware ranging from GPU clusters in data centers to highly resource constrained processors in edge devices. Conventionally, compressing a DNN requires creating policies that balance between size, speed and accuracy of the network while considering a particular hardware. A policy is selected after multiple trials and analysis on the domain space making the selection process quite laborious, time-consuming and requiring human expertise. We propose Auto Compress, which mainly utilizes Bayesian optimization to automatically figure out the compression policy based on a combination of pruning and quantization. The strategy of combining structured pruning, unstructured pruning, and quantization is assigned at the beginning, based on the selected hardware characteristics. The proposed, automatically learned compression policy, outperforms hand-crafted policies providing more compression while preserving accuracy. Applying our single-click compression to CIFAR10, on Plain20, we achieve 66.2% FLOPs reduction while getting a Top1 accuracy of 88%, whereas applying our size-focused compression on ResNet-20 achieved 11.2x reduction in size without any loss of accuracy.