SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses the unresolved question of whether spatial numerical outputs—such as coordinates or action magnitudes—generated by current vision-language models in embodied environments genuinely reflect spatial perception. To investigate this, the authors propose SpaceNum, a unified framework that systematically evaluates models’ understanding of spatial numeracy through bidirectional Num2Space and Space2Num tasks under both dynamic exploration and static layout settings. Through error analysis, reasoning trajectory inspection, and controlled interventions, they find that prevailing models perform near chance level, relying on superficial cues rather than robust coordinate awareness or structured spatial abstraction. While fine-tuning substantially improves performance and generalizes to external spatial reasoning benchmarks, explicit reasoning mechanisms yield only marginal gains. This work is the first to reveal fundamental limitations in how vision-language models anchor numerical outputs to spatial semantics.

📝 Abstract

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.

Problem

Research questions and friction points this paper is trying to address.

spatial numerical understanding

Vision-Language Models

numerical grounding

spatial reasoning

embodied AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial numerical understanding

Vision-Language Models

SpaceNum