🤖 AI Summary
This work investigates whether multimodal large language models possess egocentric 3D spatial proximity reasoning capabilities essential for perception and action in embodied intelligence. To this end, we introduce EgoProx, the first cognitive-tiered benchmark specifically designed for embodied 3D proximity reasoning, along with a scalable agent-based simulation engine that generates diverse and consistent question-answer pairs at scale. Experimental results reveal that while current models exhibit some inherent spatial knowledge, they struggle to effectively leverage it in visual question answering tasks. However, instruction fine-tuning substantially enhances their spatial reasoning performance across both tasks and domains.
📝 Abstract
Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.