Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities

📅 2024-10-22

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the vulnerability of vision-language models (VLMs) to ambiguity in spatial reference frames (FoRs). We introduce COMFORT, the first multilingual, cognitively aligned evaluation framework explicitly designed to assess FoR ambiguity. COMFORT comprises four components: multilingual spatial instruction construction, FoR annotation consistency analysis, cross-model robustness testing, and quantification of language-specific biases. We systematically evaluate nine state-of-the-art VLMs. Results reveal pervasive deficits: limited FoR flexibility, strong English-centrism, severe cross-lingual performance imbalance, low spatial reasoning consistency, and weak cultural adaptability. COMFORT is the first framework to uncover structural deficiencies in current VLMs’ spatial cognition modeling. It provides both theoretical insights and an empirical benchmark for developing more robust, culturally inclusive multilingual vision-language systems.

Technology Category

Application Category

📝 Abstract

Spatial expressions in situated communication can be ambiguous, as their meanings vary depending on the frames of reference (FoR) adopted by speakers and listeners. While spatial language understanding and reasoning by vision-language models (VLMs) have gained increasing attention, potential ambiguities in these models are still under-explored. To address this issue, we present the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs. We evaluate nine state-of-the-art VLMs using COMFORT. Despite showing some alignment with English conventions in resolving ambiguities, our experiments reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs' spatial reasoning under ambiguous references

Assess robustness and consistency in spatial frame understanding

Examine cross-lingual FoR adherence beyond English dominance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates VLMs spatial reasoning with COMFORT protocol

Tests robustness and consistency across multiple FoRs

Assesses cross-lingual adherence to cultural conventions

🔎 Similar Papers

No similar papers found.