Abstract:
Substantial research is being conducted in spatial and spatiotemporal saliency modeling for
images and videos respectively with the aim of eye xation prediction and salient object
detection as it is successfully used in a number of contemporary disciplines like video
segmentation, compression, summarization, robotics etc. it is inherently a handy task for
humans to perceive and interpret their surroundings with the use of their major senses vision
and hearing. The sensory system lets humans focus on interesting parts of scene, namely
salient regions, around them while disregarding the non-valuable information without any
noticeable e ort on their part. The task seemingly so easy for humans is not quite so for
a computer, that is why, the research eld of computer vision extensively strives towards
rendering machines with such kind of senses. Spatiotemporal saliency modeling is generally
a challenging task for it requires feature extraction and selection tasks at pixel or region level
yet its more so because of di erent video conditions like background clutter, camera motion,
object occlusion and interaction, object deformation etc. This work aims at providing an
audio-video spationtemporal saliency model for saliency map computations of complex
scenes. The proposed solution is to be evaluated on publically available video dataset
against eye xations event data and compared with state-of-the-art spatiotemporal saliency
models.