🤖 AI Summary
This work identifies and formally names the “Cartesian shortcut” problem in current multimodal large language models (MLLMs), which rely on grid-based layouts in Cartesian coordinates and textualized coordinate references for visual reasoning rather than genuinely understanding visual spatial structures. To systematically evaluate topological invariance in spatial reasoning, the authors introduce Polaris-Bench, a novel benchmark comprising 53 pairs of logically equivalent tasks formulated under distinct coordinate systems—Cartesian and polar. Experimental results reveal a stark performance drop across 14 state-of-the-art MLLMs, with accuracy plummeting to 31%–39% on polar-coordinate tasks compared to 70%–83% in Cartesian settings, thereby exposing their fundamental deficiency in true spatial generalization beyond coordinate-specific heuristics.
📝 Abstract
As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the \textbf{Cartesian Shortcut}: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce \textbf{Polaris-Bench}, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.