Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

132K/year

🤖 AI Summary

To address the scarcity of research on multilingual referring expression comprehension (REC), this paper introduces the first unified REC dataset covering ten languages and proposes a cross-lingual visual grounding framework based on attention-anchored localization. The framework employs a multilingual SigLIP encoder to generate coarse-grained spatial anchors, refined via residual learning for improved localization accuracy; it further integrates context-enhanced translation with machine translation to ensure cross-lingual semantic consistency. Evaluated on the multilingual RefCOCO benchmark, the model achieves 86.9% IoU@50 accuracy, substantially outperforming existing methods. This work provides the first systematic empirical validation of cross-lingual transferability of vision–language alignment in multilingual REC, demonstrating that alignment learned from high-resource languages generalizes effectively to low-resource ones. It establishes a scalable technical pathway for visual grounding in low-resource languages, advancing both multilingual vision–language understanding and practical deployment across diverse linguistic settings.

Technology Category

Application Category

📝 Abstract

Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $href{https://multilingual.franreno.com}{multilingual.franreno.com}$.

Problem

Research questions and friction points this paper is trying to address.

Addressing multilingual referring expression comprehension beyond English-centric approaches

Building unified multilingual dataset by translating English benchmarks across 10 languages

Developing attention-anchored architecture for cross-lingual object localization in images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed unified multilingual dataset through translation enhancement

Introduced attention-anchored neural architecture with multilingual encoders

Generated spatial anchors from attention distributions with residual refinement

🔎 Similar Papers

Computer Vision Datasets and Models Exhibit Cultural and Linguistic Diversity in Perception