🤖 AI Summary
This study addresses the scarcity of high-quality image captioning approaches for Hindi by proposing the first end-to-end multimodal image description model tailored to this language. The model integrates local and global visual features extracted from VGG16, ResNet50, and Inception V3, and employs a bidirectional LSTM with an attention mechanism to generate semantically accurate and fluent Hindi captions. Evaluated on the Flickr8k dataset, the proposed approach achieves BLEU-1 and BLEU-4 scores of 0.59 and 0.19, respectively, demonstrating its effectiveness in bridging the gap in non-English multimodal generation research.
📝 Abstract
Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional and bi-directional techniques of text encoding are used for the text encoding process. An additional Attention layer helps to generate a weight vector and, by multiplying it, combine image characteristics from each time step into a sentence-level feature vector. Bilingual evaluation understudy scores are used to compare the research outcome. Many experiments that serve as a baseline are done for the comparative analysis of the research. An image with a score of BLEU-1 is considered sufficient, whereas one with a score of BLEU-4 is considered to have fluid image captioning. For both BLEU scores, the attention-based bidirectional LSTM with VGG16 produced the best results of 0.59 and 0.19 respectively. The experiments conclude that researchs ability to produce relevant, semantically accurate image captions in Hindi. The research accomplishes the goals and future research can be guided by this research model.