Abstract:
Convolutional Neural Network (CNN) is an important machine learning algorithm. Due to its broad applications and classification accuracy it has become hot topic in recent times. Convolutional Neural Networks are both computationally expensive and have extensive memory accesses which has rendered it inefficient on general purpose computers. GPU implementations have improved the performance of algorithm but high energy consumption of GPUs doesn’t allow its usage in robotics and mobile embedded platforms. This study presents the implementation details of mapping Convolutional Neural Networks on field programmable gate arrays (FPGAs). Visual Geometric Group (VGG-16) Networks are the most admired CNN architectures in community. They have uniform and regular structure which is most suitable to be implemented on FPGA. So, a detailed discussion of mapping VGG-16 style networks on FPGA is presented. Flower Recognition example of Kaggle was used as case study. Training of a VGG style network was carried out on core i9 computer with NVIDIA GTX 1660 GPU. On dataset trained network achieved an accuracy of 90%. Trained CNNs are algorithmically simple to model and deploy. Xilinx Zynq Zedboard was used for analytical modeling and mapping of CNN. Trained CNN was partitioned into two parts hardware part and software part. Hardware part being comprised of computationally extensive convolutions and software part being comprised of computationally less expensive tasks such as Pooling layer, Fully Connected layer and SoftMax layer. Hardware part of CNN was mapped on Zynq-PL and software part was mapped on Zynq-PS. For different types of parallelism opportunities that exist in CNN workload, proposed methodology achieved inter output parallelism in design of hardware accelerator on Zynq-PL. Hardware design on Zynq-PL also took into consideration memory access patterns of convolution operation and optimized them to achieve good performance. For a complete network implementation, proposed methodology achieved a peak performance of 1.3 GMACCs at 120 MHz frequency and achieved a speed up of 4 times compared to software implementation on General Purpose Computer.