Align Where the Words Look: Cross-Attention-Guided Patch Alignment with Contrastive and Transport Regularization for Bengali Captioning

๐Ÿ“… 2025-09-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the scarcity of paired image-text data, translation-induced misalignment in pivot-based approaches, and weakened semantic representation of low-resource languages (e.g., Bengali) under English-centric pretraining in vision-language modeling, this paper proposes a fine-grained alignment-enhanced image captioning method. Methodologically, it introduces (1) a cross-attention-guided Patch Alignment Loss (PAL), integrating InfoNCE global contrastive learning with Sinkhorn optimal transport regularization for triple-cooperative optimization; (2) frozen MaxViT for robust visual patch extraction, coupled with a native mBART-50 decoder and a lightweight bridging module for efficient cross-modal fusion; and (3) training on LaBSE-validated Englishโ€“Bengali image-text pairs augmented with 110K bilingual-prompt-synthesized images. On Flickr30k-1k and MSCOCO-1k, the model achieves BLEU-4 scores of 12.29/12.00 and BERTScore-F1 of 71.20/75.40, while reducing the inter-class center distance between real and synthetic data by 41%.

Technology Category

Application Category

๐Ÿ“ Abstract
Grounding vision--language models in low-resource languages remains challenging, as they often produce fluent text about the wrong objects. This stems from scarce paired data, translation pivots that break alignment, and English-centric pretraining that ignores target-language semantics. We address this with a compute-aware Bengali captioning pipeline trained on LaBSE-verified EN--BN pairs and 110k bilingual-prompted synthetic images. A frozen MaxViT yields stable visual patches, a Bengali-native mBART-50 decodes, and a lightweight bridge links the modalities. Our core novelty is a tri-loss objective: Patch-Alignment Loss (PAL) aligns real and synthetic patch descriptors using decoder cross-attention, InfoNCE enforces global real--synthetic separation, and Sinkhorn-based OT ensures balanced fine-grained patch correspondence. This PAL+InfoNCE+OT synergy improves grounding, reduces spurious matches, and drives strong gains on Flickr30k-1k (BLEU-4 12.29, METEOR 27.98, BERTScore-F1 71.20) and MSCOCO-1k (BLEU-4 12.00, METEOR 28.14, BERTScore-F1 75.40), outperforming strong CE baselines and narrowing the real--synthetic centroid gap by 41%.
Problem

Research questions and friction points this paper is trying to address.

Grounding vision-language models for low-resource Bengali language
Aligning visual patches with Bengali text using cross-attention guidance
Reducing spurious matches between images and Bengali captions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns real and synthetic patches using decoder cross-attention guidance
Combines InfoNCE loss for global real-synthetic separation
Uses Sinkhorn-based optimal transport for balanced patch correspondence
๐Ÿ”Ž Similar Papers
No similar papers found.