EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) struggle to effectively model cross-frame spatial relationships in multi-frame spatial reasoning tasks, while approaches relying on 3D priors or geometric supervision incur high data acquisition costs. This work proposes EgoMind, a novel framework that, for the first time, activates MLLMs’ spatial reasoning capabilities through a purely language-based inference mechanism, eliminating the need for 3D data or geometric annotations. EgoMind constructs cross-frame linguistic scene graphs via role-playing image-text descriptions and integrates progressive spatial analysis with a chain-of-thought architecture. Using only 5K automatically synthesized supervised fine-tuning (SFT) samples and 20K reinforcement learning (RL) samples, the method achieves competitive performance on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, substantially reducing data preparation costs.
📝 Abstract
Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.
Problem

Research questions and friction points this paper is trying to address.

spatial cognition
multimodal large language models
multi-frame spatial reasoning
cross-frame spatial relationships
geometry-free reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought
linguistic reasoning
spatial cognition
geometry-free
multimodal LLMs
🔎 Similar Papers
No similar papers found.
Z
Zhenghao Chen
State Key Laboratory of Complex and Critical Software Environment, Beihang University; School of Computer Science and Engineering, Beihang University
H
Huiqun Wang
State Key Laboratory of Complex and Critical Software Environment, Beihang University; School of Computer Science and Engineering, Beihang University
Di Huang
Di Huang
Computer Science and Engineering, Beihang University
Computer VisionRepresentation LearningGenerative AIEmbodied AI