Abstract:
CNN have been proven effective in deep learning methods for Huaman Action Recognition
(HAR) along with other computer vision tasks but the problem of overfitting in this domain
remains till date, as deep learning models need large amount of data for training. This thesis is
inspired by the two-stream network for HAR where CNN has been deployed as a base model to
show that both, the spatial and the temporal aspects of an action are important for its recognition.
To deal with the mentioned issue we have proposed enhancement of the spatial stream, which
consists of two parts. Primarily, we adopted transfer learning in the spatial stream, where we
demonstrated that by using models which are pre-trained on larger datasets like ImageNet yields
good performance instead of training the original model from scratch. Secondly, we offer dataset
augmentation technique, where we increased the dataset size by performing various random
transformations like rotations, cropping and flipping on the image. Further, fine-tuning the network
of the enhanced spatial stream on the augmented dataset increases the accuracy.
Our architecture is trained and tested on UCF-101 dataset, which is the latest and standard
benchmark for action videos. Our results are competent and are comparable with the state of the
art two-strean network’s results. Also, our network performed well in the spatial stream as
compared to other models.