Abstract:
Convolutional Neural Network (CNN) is an important machine learning algorithm. Due to its
broad applications and classification accuracy it has become hot topic in recent times.
Convolutional Neural Networks are both computationally expensive and have extensive memory
accesses which has rendered it inefficient on general purpose computers. GPU implementations
have improved the performance of algorithm but high energy consumption of GPUs doesn’t allow
its usage in robotics and mobile embedded platforms. This study presents the implementation
details of mapping Convolutional Neural Networks on field programmable gate arrays (FPGAs).
Visual Geometric Group (VGG-16) Networks are the most admired CNN architectures in
community. They have uniform and regular structure which is most suitable to be implemented on
FPGA. So, a detailed discussion of mapping VGG-16 style networks on FPGA is presented.
Flower Recognition example of Kaggle was used as case study. Training of a VGG style network
was carried out on core i9 computer with NVIDIA GTX 1660 GPU. On dataset trained network
achieved an accuracy of 90%. Trained CNNs are algorithmically simple to model and deploy.
Xilinx Zynq Zedboard was used for analytical modeling and mapping of CNN. Trained CNN was
partitioned into two parts hardware part and software part. Hardware part being comprised of
computationally extensive convolutions and software part being comprised of computationally less
expensive tasks such as Pooling layer, Fully Connected layer and SoftMax layer. Hardware part
of CNN was mapped on Zynq-PL and software part was mapped on Zynq-PS. For different types
of parallelism opportunities that exist in CNN workload, proposed methodology achieved inter
output parallelism in design of hardware accelerator on Zynq-PL. Hardware design on Zynq-PL
also took into consideration memory access patterns of convolution operation and optimized them
to achieve good performance. For a complete network implementation, proposed methodology
achieved a peak performance of 1.3 GMACCs at 120 MHz frequency and achieved a speed up of
4 times compared to software implementation on General Purpose Computer.