dc.description.abstract |
Interaction recognition, a sub-domain of human activity recognition that primarily focuses on recognizing actions occurring between two subjects, which could be human-human or human-object interactions. In this area, researchers have concentrated on tasks such as object or person detection, tracking, and recognizing actions performed between subjects in videos. However, recognizing such activities in videos poses challenges, resulting in a limited availability of methods overall. Significant improvements have been made in recognizing solo actions, research on recognizing complex activities involving multiple subjects is ongoing. In this research study, we propose a novel keypoints-based deep learning model called 'InterAcT', that focuses on recognizing solo actions and interactions between two individuals in grayscale aerial videos. InterAcT is inspired from Action Transformer (AcT) model that captures spatial and temporal information using pose data. It features a lightweight architecture with 0.0795 million parameters and 0.0389 giga flops, distinguishing it from the AcT models. The pipeline primarily consists of a preprocessing stage and a pose-based deep learning transformer model. The preprocessing stage includes data augmentation, person detection, keypoints extraction, and data transformation modules. The transformer stage comprises six components: Linear Projection of Features, Class Token Embedding, Position Embedding, Transformer Encoder Layers, MLP Head, and Predicted Class Label. The transformer model utilizes sequential 2D pose data for training and outputs the recognized class. For performance evaluation, we have used two public datasets: the Drone Action dataset and UT-Interaction dataset, totaling 18 classes (13 solo actions and 5 interaction classes). The model was trained on 80% of the train set, validated on 10% of the validation set, and tested on 10% of the test set, achieving an
V
accuracy of 99.23%. On the same preprocessed data, we compared our model with benchmark models. It outperformed the AcT models (micro: 93.53%, small: 98.93%, base: 99.07%, and large: 95.58%). 2P-GCN achieved an accuracy of 93.37%, LSTM achieved 97.74%, 3D-ResNet achieved 99.21%, and 3D CNN achieved 99.20%. Our novel framework has the strength to recognize a large number of solo actions and two-person interaction classes in aerial videos, as well as fixed camera videos (grayscale and RGB). Due to its lightweight architecture, it can be utilized in real-world applications such as security and surveillance. |
en_US |