Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of research on multilingual referring expression comprehension (REC), this paper introduces the first unified REC dataset covering ten languages and proposes a cross-lingual visual grounding framework based on attention-anchored localization. The framework employs a multilingual SigLIP encoder to generate coarse-grained spatial anchors, refined via residual learning for improved localization accuracy; it further integrates context-enhanced translation with machine translation to ensure cross-lingual semantic consistency. Evaluated on the multilingual RefCOCO benchmark, the model achieves 86.9% IoU@50 accuracy, substantially outperforming existing methods. This work provides the first systematic empirical validation of cross-lingual transferability of vision–language alignment in multilingual REC, demonstrating that alignment learned from high-resource languages generalizes effectively to low-resource ones. It establishes a scalable technical pathway for visual grounding in low-resource languages, advancing both multilingual vision–language understanding and practical deployment across diverse linguistic settings.

Technology Category

Application Category

📝 Abstract
Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $href{https://multilingual.franreno.com}{multilingual.franreno.com}$.
Problem

Research questions and friction points this paper is trying to address.

Addressing multilingual referring expression comprehension beyond English-centric approaches
Building unified multilingual dataset by translating English benchmarks across 10 languages
Developing attention-anchored architecture for cross-lingual object localization in images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed unified multilingual dataset through translation enhancement
Introduced attention-anchored neural architecture with multilingual encoders
Generated spatial anchors from attention distributions with residual refinement
🔎 Similar Papers
No similar papers found.
F
Francisco Reis Nogueira
Instituto Superior Técnico
Alexandre Bernardino
Alexandre Bernardino
Institute for Systems and Robotics (ISR/IST), LARSyS, Instituto Superior Técnico, Univ Lisboa
Computer VisionRobotics
B
Bruno Emanuel da Graça Martins
Instituto Superior Técnico