Abstract:
Internet technologies are generating enormous amount of data that merges textual and visual
content: tagged images, descriptions in newspaper, videos with captions, and social media
feeds. Such interaction with technology and devices has become part of everyday life, for
example explaining an image in the context of news, following instructions by interpreting
a diagram or a map, understanding presentations while listening to a lecture. Traditionally,
content providers manually added captions to make this more accessible. These captions
are used by text-to-speech system to generate a natural-language description of images and
videos. Recent years have seen an upsurge of interest in problems that require a combination
of language and visual contents to develop methods for automatically generating image
descriptions.
Due to the potential applications in computer vision, information retrieval, autonomous vehicles
and natural language processing (NLP), automatic generation of sequence of words
known as caption for an image has captured enormous consideration in past decade. Various
techniques have been proposed for automatic generation of image descriptions using
most suitable annotation in the training set. These training annotations are sometimes rearranged
or also boosted by natural language processing (NLP) algorithms. Despite significant
achievements in generating sentences for images, existing models struggle to capture
human-like semantics in generated descriptions.
In this thesis, three novel image description techniques have been proposed to generate semantically
superior captions of the target image. The first proposed technique incorporates
topic sensitive word embedding for generation of image description. Topic Models consider
documents to be associated with different topics based on probability distribution over
words. The proposed approach uses topic modeling to align semantic meaning of words
to image features and generate descriptions that are more relevant to context (topic) of the
target image regions. Compared to traditional models, the proposed approach utilizes high
level semantics of words to represent diversity in the training corpus.
Convolutional layers of the visual encoder used on traditional models generate feature maps
to extract hierarchical information from the visual contents. These convolution layers do
not exploit the dependencies between feature maps which can result in loss of essential information
to guide language model for description generation. The second proposed model
incorporates scene information to capture the overall setting reflected in the visual content
along with object level features using squeeze-and-excitation module and spatial details to boost the accuracy of caption generation. Visual features are coupled with location information
along with topic modeling to capture semantic word relationships to feed sequence-tosequence
word generation task.
Third proposed approach addresses the challenges in remote sensing image description
due to large variance in the visual aspects of objects. Multi-scale visual feature encoder
is proposed to extract detailed information from remote sensing images. Adaptive attention
decoder dynamically assigns weights to the multi-scale features and textual queues to
strengthen the language model to generate novel topic sensitive descriptions.