๐ค AI Summary
To address the scarcity of paired image-text data, translation-induced misalignment in pivot-based approaches, and weakened semantic representation of low-resource languages (e.g., Bengali) under English-centric pretraining in vision-language modeling, this paper proposes a fine-grained alignment-enhanced image captioning method. Methodologically, it introduces (1) a cross-attention-guided Patch Alignment Loss (PAL), integrating InfoNCE global contrastive learning with Sinkhorn optimal transport regularization for triple-cooperative optimization; (2) frozen MaxViT for robust visual patch extraction, coupled with a native mBART-50 decoder and a lightweight bridging module for efficient cross-modal fusion; and (3) training on LaBSE-validated EnglishโBengali image-text pairs augmented with 110K bilingual-prompt-synthesized images. On Flickr30k-1k and MSCOCO-1k, the model achieves BLEU-4 scores of 12.29/12.00 and BERTScore-F1 of 71.20/75.40, while reducing the inter-class center distance between real and synthetic data by 41%.
๐ Abstract
Grounding vision--language models in low-resource languages remains challenging, as they often produce fluent text about the wrong objects. This stems from scarce paired data, translation pivots that break alignment, and English-centric pretraining that ignores target-language semantics. We address this with a compute-aware Bengali captioning pipeline trained on LaBSE-verified EN--BN pairs and 110k bilingual-prompted synthetic images. A frozen MaxViT yields stable visual patches, a Bengali-native mBART-50 decodes, and a lightweight bridge links the modalities. Our core novelty is a tri-loss objective: Patch-Alignment Loss (PAL) aligns real and synthetic patch descriptors using decoder cross-attention, InfoNCE enforces global real--synthetic separation, and Sinkhorn-based OT ensures balanced fine-grained patch correspondence. This PAL+InfoNCE+OT synergy improves grounding, reduces spurious matches, and drives strong gains on Flickr30k-1k (BLEU-4 12.29, METEOR 27.98, BERTScore-F1 71.20) and MSCOCO-1k (BLEU-4 12.00, METEOR 28.14, BERTScore-F1 75.40), outperforming strong CE baselines and narrowing the real--synthetic centroid gap by 41%.