🤖 AI Summary
To address the scarcity of research on multilingual referring expression comprehension (REC), this paper introduces the first unified REC dataset covering ten languages and proposes a cross-lingual visual grounding framework based on attention-anchored localization. The framework employs a multilingual SigLIP encoder to generate coarse-grained spatial anchors, refined via residual learning for improved localization accuracy; it further integrates context-enhanced translation with machine translation to ensure cross-lingual semantic consistency. Evaluated on the multilingual RefCOCO benchmark, the model achieves 86.9% IoU@50 accuracy, substantially outperforming existing methods. This work provides the first systematic empirical validation of cross-lingual transferability of vision–language alignment in multilingual REC, demonstrating that alignment learned from high-resource languages generalizes effectively to low-resource ones. It establishes a scalable technical pathway for visual grounding in low-resource languages, advancing both multilingual vision–language understanding and practical deployment across diverse linguistic settings.
📝 Abstract
Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $href{https://multilingual.franreno.com}{multilingual.franreno.com}$.