🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit significant limitations in 3D spatial perception and reasoning, with perception enhancement and reasoning modeling typically treated in isolation. This work introduces the first unified framework that deeply integrates auxiliary modality generation—specifically depth and segmentation maps—with an adaptive interleaved reasoning mechanism, enabling models to internalize 3D spatial knowledge. A two-stage joint training strategy simultaneously optimizes both auxiliary modality generation and spatial reasoning capabilities. Experiments demonstrate an average 6.91% improvement on spatial reasoning benchmarks; a variant employing only auxiliary modality generation achieves a 7.92% gain in distance and size estimation without compromising general multimodal understanding performance. The core contribution lies in establishing the first perception–reasoning co-modeling paradigm, offering a novel foundation for embodied spatial intelligence.
📝 Abstract
Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose extbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average extbf{6.91%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a extbf{7.92%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.