Abstract:
Automatic image captioning is an active and highly challenging research problem in computer vision aiming to understand and describe the contents of the scene in human understandable language. Existing solutions for image captioning are based on holistic approaches where the whole image is described at once, potentially losing the important aspects of the scene. To enable, more detailed captioning, we propose Dense CaptionNet, a deep region-based modular image captioning architecture, which extracts and describes each region of the image individually to include more details of the scene in the overall caption. The proposed architecture consists of three main modules to describe the image objects. The first one generates region descriptions which not only includes objects but object relationships as well and the second one generates the attributes related to those objects. The textual descriptions generated by these two modules are fused in a text file to provide as input for the subsequent sentence generation module which works as an encoder-decoder framework to merge and form a single meaningful and grammatically correct sentence which is detailed enough to describe the whole scene in a better way. The proposed architecture is trained using Visual Genome, IAPR TC-12 and MSCOCO datasets and tested on un-seen set of IAPR TC-12 dataset because of detailed nature of its descriptions. The trained architecture out-performs the existing state-of-the-art techniques e.g., Neural Talk and Show, Attend and Tell, using standard evaluation metrics especially on complex scenes.