Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limitations of multimodal large language models in embodied spatial intelligence, particularly their susceptibility to the “Cartesian illusion,” which impedes second-order theory-of-mind reasoning in multi-agent settings. To tackle this, the authors introduce an audiovisual task requiring agent A to infer agent B’s belief about A’s location based on B’s perceptual constraints—such as orientation and sensory bottlenecks. Methodologically, they propose a cognitive sensory bottleneck module combined with an anchor-driven, embodied spatial decomposition chain-of-thought that eschews rigid coordinate transformations. Their approach integrates local coordinate modeling, perceptual frustum constraints, and dynamically weighted audiovisual fusion. Experiments demonstrate that the method substantially outperforms baselines by 42% under zero-shot conditions, exposing fundamental shortcomings in current models regarding spatial symmetry and ambiguity beyond the field of view.

📝 Abstract

While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a "geometric-to-semantic" projection, forcing it to first establish B's local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B's visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.

Problem

Research questions and friction points this paper is trying to address.

Theory of Mind

Multi-Modal Large Language Models

Embodied Spatial Intelligence

Perceptual Bottlenecks

Spatial Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Epistemic Sensory Bottleneck

Anchor-Based Embodied Spatial Decomposition

Two-Stage Multi-Modal Theory of Mind