EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) struggle to effectively model cross-frame spatial relationships in multi-frame spatial reasoning tasks, while approaches relying on 3D priors or geometric supervision incur high data acquisition costs. This work proposes EgoMind, a novel framework that, for the first time, activates MLLMs’ spatial reasoning capabilities through a purely language-based inference mechanism, eliminating the need for 3D data or geometric annotations. EgoMind constructs cross-frame linguistic scene graphs via role-playing image-text descriptions and integrates progressive spatial analysis with a chain-of-thought architecture. Using only 5K automatically synthesized supervised fine-tuning (SFT) samples and 20K reinforcement learning (RL) samples, the method achieves competitive performance on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, substantially reducing data preparation costs.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.

Problem

Research questions and friction points this paper is trying to address.

spatial cognition

multimodal large language models

multi-frame spatial reasoning

cross-frame spatial relationships

geometry-free reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought

linguistic reasoning

spatial cognition