Image Description using Deep Learning

Usman Zia

DSpace Home
→
E-Theses
→
MCS
→
Computer Software Engineering
→
PhD (CS)
→
View Item

Image Description using Deep Learning

Usman Zia

URI: http://10.250.8.41:8080/xmlui/handle/123456789/32291

Date: 2022-12

Abstract:

Internet technologies are generating enormous amount of data that merges textual and visual content: tagged images, descriptions in newspaper, videos with captions, and social media feeds. Such interaction with technology and devices has become part of everyday life, for example explaining an image in the context of news, following instructions by interpreting a diagram or a map, understanding presentations while listening to a lecture. Traditionally, content providers manually added captions to make this more accessible. These captions are used by text-to-speech system to generate a natural-language description of images and videos. Recent years have seen an upsurge of interest in problems that require a combination of language and visual contents to develop methods for automatically generating image descriptions. Due to the potential applications in computer vision, information retrieval, autonomous vehicles and natural language processing (NLP), automatic generation of sequence of words known as caption for an image has captured enormous consideration in past decade. Various techniques have been proposed for automatic generation of image descriptions using most suitable annotation in the training set. These training annotations are sometimes rearranged or also boosted by natural language processing (NLP) algorithms. Despite significant achievements in generating sentences for images, existing models struggle to capture human-like semantics in generated descriptions. In this thesis, three novel image description techniques have been proposed to generate semantically superior captions of the target image. The first proposed technique incorporates topic sensitive word embedding for generation of image description. Topic Models consider documents to be associated with different topics based on probability distribution over words. The proposed approach uses topic modeling to align semantic meaning of words to image features and generate descriptions that are more relevant to context (topic) of the target image regions. Compared to traditional models, the proposed approach utilizes high level semantics of words to represent diversity in the training corpus. Convolutional layers of the visual encoder used on traditional models generate feature maps to extract hierarchical information from the visual contents. These convolution layers do not exploit the dependencies between feature maps which can result in loss of essential information to guide language model for description generation. The second proposed model incorporates scene information to capture the overall setting reflected in the visual content along with object level features using squeeze-and-excitation module and spatial details to boost the accuracy of caption generation. Visual features are coupled with location information along with topic modeling to capture semantic word relationships to feed sequence-tosequence word generation task. Third proposed approach addresses the challenges in remote sensing image description due to large variance in the visual aspects of objects. Multi-scale visual feature encoder is proposed to extract detailed information from remote sensing images. Adaptive attention decoder dynamically assigns weights to the multi-scale features and textual queues to strengthen the language model to generate novel topic sensitive descriptions.

Show full item record