Abstract:
Video has become more popular in many applications in recent years due to increased storage
capacity, more advanced network architectures, as well as easy access to digital cameras,
especially in mobile phones. Classification of the type of motion in a video sequence is an area
targeted by many researchers for the purpose of traffic control, video scene classification, event
prediction, sport analysis, management of web videos etc. There are several conventional and
unconventional techniques for motion classification in videos but due to the advent of
sophisticated algorithms and high computational capabilities deep learning architectures are
utilized for almost every image/video processing task including motion classification. Deep
learning methodology is more reliable and effective than other approaches. Training a deep
learning architecture for motion classification requires that all of the frames (pixel by pixel)
are fed to the network along with their corresponding label and once the network learns the
classification task, we can use it for inference purpose. However, this method requires a lot of
memory and computational resources as large amount of data (all the frames in a video) needs
to be processed by the architecture. We aim to reduce the amount of data to be processed by
the deep learning architecture for motion classification task this subsequently results in low
memory requirements and reduced computational complexity. At the same time, we strive for
maintaining the classification accuracy. A video is a sequence of individual frames hence there
exists a lot of temporal redundancy between consecutive frames. This redundancy can be
exploited by traditional motion estimation which gives us awareness about the motion
information in a video sequence. If instead of inputting the standard video frames to the deep
learning architectures, we feed them the motion information so that our architectures have to
process much less amount of information for the motion classification task. In our work the
motion information in a video sequence is retrieved by using the three-step search which is a
block matching algorithm. This algorithm gives us the motion vectors which contain the motion
information in a video sequence and hence we train our network on these motion vectors
instead of the standard frames to achieve motion classification task. Experimental results show
that by employing our proposed method the motion classification task can be carried out by
processing much less amount of information while maintain good accuracies