Abstract:
Image captioning is a field within artificial intelligence that is progressing
rapidly and it has a lot of potential when it comes to helping people who are visually
impaired. It has been developing over the past few years and there are many
implementations that can adequately caption an image. Another concept that has gained
a lot of traction in the past few years is Zero-Shot Learning. It aims to bridge the
shortcomings present in training datasets by identifying unknown classes using
semantic concepts that are present in the dataset. A major problem when working in the
field of image captioning is the limited amount of data that is available to us as is. The
only dataset considered suitable enough for the task is the Microsoft: Common Objects
in Context (MSCOCO) dataset, which contains about 120,000 training images. This
covers about 80 object classes, which is an insufficient amount if we want to target
these techniques for real life use. In order to overcome this problem, we propose a
solution that incorporates Zero-Shot Learning concepts in order to identify unknown
objects and classes by using semantic word embeddings and existing state-of-the-art
object identification algorithms. Doing this will enable the image captioning algorithm
to be more robust and tailored towards real-life use.
Our proposed model, Image Captioning using Novel Word Injection, uses a pre-trained
caption generator and works on the output of the generator to inject objects that are not
present in the dataset into the caption. We report a 74% positive correction ratio over
the captions generated by the underlying generator, where the ratio represents the
number of changes in which an object was correctly identified and injected in the
caption.