Abstract:
In Computer vision, object detection and classification are active fields of research.
Applications of object detection and classification includes a diverse range of fields such as
surveillance, autonomous cars, robotic vision, search and rescue, driver assistance systems
and military applications. Many intelligent systems are built by researchers to achieve the
accuracy of human perception but could not quite achieve it yet. In the last couple of decades,
Convolution Neural Network (CNN) emerged as the most active field of research. There are a
number of applications of CNN, and its architectures are used for the improvement of accuracy and
efficiency in various fields. In this research, we aim to use CNN in order to generate fusion of visible
and thermal camera images to detect persons present in those images for a reliable surveillance
application. There are various kinds of image fusion methods to achieve multi-sensor, multi-modal,
multi-focus and multi-view image fusion. Our proposed methodology includes Encoder-Decoder
architecture for fusion of visible and thermal images, ResNet-152 architecture for classification of
images. KAIST multi-spectral dataset consisting of 95,000 visible and thermal images is used for
training of CNNs. During experimentation, it is observed that fused architecture outperforms
individual visible and thermal based architectures, where fused architecture gives 99.2% accuracy
while visible gives 99.01% and thermal gives 98.98% accuracy. Images obtained from ResNet-152 are
then fed into Mask-RCNN for localization of persons. Mask-RCNN uses ResNet-101 architecture for
localization of objects. From the results it can be clearly seen that Fused model for object
localization outperforms the Visible model and gives promising results for person detection for
surveillance purposes. Our proposed localization module gives a miss rate of 5.25%, which is 5
percent better than previous best techniques proposed.