<strong>Paper Title</strong><br>

A Comparative Study of Various Deep Learning Models for Image Caption Generation on Indian Historical Data<br>

<br>


<strong>Abstract</strong><br>

ImageCaptioning follows encoder-decoder architecturethatposesachallengeforimageanalysisandtextgeneration. Due to the success of the attention-based deep learning model for both language translation and image processing, the auto image captioning problem has received a lot of attention. Improving   the performance of each part of the framework or employing a more effective attention mechanism will have a positive impact on eventual performance. We proposed a newly created datasetcontaininghistoricalsitesinIndiasuchashistoricaltemples,step- wells, carved columns, and sculptures of the god, goddess, and people. In this research, we have used various deeplearning encoder-decoder architectures, VGG16-LSTM and ResNet50- LSTM, Resnet152-LSTM, InceptionV3-LSTM, EfficientNetB0+ LSTM and the Transformer for the image captioning where LSTM is the most powerful decoder. Recent work shows that the Transformer is superior to LSTM in efficiency and performance for some NLP tasks and captioning tasks. The Transformer consists of multiple pairs of encoder-decoders where encoders represent the image feature vectors based on self-attentionto extract important features and allow all feature vectors to inter- act with each other and figure out who to pay more attention to. The decoders increase multi-head attention, which helps generate the sequence of words sequentially based on the contextualized encoding sequence. This study presents the comparative analysis of the performance of six models implemented on the newly created data set. The performance measures considered in this studyaretheBLEUscores[1-4].

Keywords - Image-captioning, Encoder-Decoder, Multi-Head Attention