Paper Title
Image Caption Generation with Novel Application Domains
Abstract
Generating a natural language description of an image is attracting a lot of interest these days primarily because of its importance in practical applications and also because it connects two major fields of artificial intelligence namely computer vision and natural language processing. Existing approaches are either top-to-down, which starts from a gist of an image and converts it into words, or bottom-to-up, which comes up with words describing various aspects of an image and then combines them. In this paper, an algorithm that uses a top-down approach through a hybrid system with the help of ResNet50 architecture (a multi-layer Convolutional Neural Network (CNN)) for image feature extraction and a Long Short Term Memory (LSTM) to accurately structure meaningful sentences has been employed. The efficiency of our proposed model is showcased using the Flickr 8K dataset. In this experiment, the Flickr 8K dataset was utilized which was found to be generating sensible and accurate captions in a majority of cases. Once the base model for caption generation is made ready, the experiment is further extended into two applications, namely, image retrieval using Pearson’s Correlation Coefficient and text-to-speech conversion for the visually impaired. The generated captions are stored in a database. Then an image retrieval system using Pearson's Correlation Coefficient is built which aims at retrieving images from the database that are similar to the query caption using Pearson’s Correlation Coefficient. Apart from this, for our second application, the text of the captions is converted into speech for the visually impaired using google’s TTS API.
Keywords - Image Captioning, Convolutional Neural Networks (CNN), Residual Neural Network (ResNet), Recurrent Neural Network(RNN), Long Short Term Memory (LSTM).