Abstract:
The combination of spatiotemporal videos and essential features can improve the
performance of human action recognition (HAR); however, the individual type of
features usually degrades the performance due to similar actions and complex
backgrounds. The deep convolutional neural network has improved performance in
recent years for several computer vision applications due to its spatial information
such as video surveillance. This dissertation proposes three techniques for human
action recognition using deep learning. The first technique is called HybridHR-Net for
action recognition in the video surveillance. Deep transfer learning is employed to
utilize the pre-trained EfficientNet-b0 deep model. Bayesian optimization is employed
for the tuning of hyper parameters of the fine-tuned deep model. Instead of fully
connected layer features, average pooling layer is employed and performed activation
for the feature extraction. Two feature selection techniques- an improved artificial bee
colony and an entropy-based approach are employed for the selection of best features.
Using a serial nature technique, the features that were selected are combined into a
single vector, and then the results are categorized by machine learning classifiers.
In the second technique, a new framework is proposed for accurate human action
recognition (HAR) based on deep learning and an improved features optimization
algorithm. From deep learning feature extraction to feature classification, the
proposed framework includes several important steps. Before training fine-tuned deep
learning models – MobileNet-V2 and Darknet53 – the original video frames are
normalized. For feature extraction, pre-trained deep models are used, which are fused
using the canonical correlation approach. Following that, an improved particle swarm
optimization (IPSO)-based algorithm is used to select the best features. Following
that, the selected features were used to classify actions using various classifiers.
In the third technique, a human detection process is performed using correlation
filtering and traditional features. The humans are recognized in static images through
omega shapes. For this process, the correlation filters are linked with pre-processing
algorithms to recognize a human in video imagery. Background extraction is
performed to avoid extra details that can make recognition more complex. Moreover,
an optimized correlation values have been calculated through a Hierarchal Particle
Swarm Optimization (HPSO) algorithm for the final classification.
iv
Five publicly accessible datasets have been utilized for the experimental process of
the first methodology and obtained notable accuracy of 97%, 98.7%, 100%, 99.7%,
and 96.8%, respectively. For the second methodology, sex datasets such as KTH, UT Interaction, UCF Sports, Hollywood, IXMAS, and UCF YouTube are used and
attained an accuracy of 98.3%, 98.9%, 99.8%, 99.6%, 98.6%, and 100%, respectively.
The third method is evaluated on KTH dataset and obtained the improved detection
accuracy. Additionally, a comparison of the proposed framework with contemporary
methods is done to demonstrate the increase in accuracy.