COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit significant limitations in 3D spatial perception and reasoning, with perception enhancement and reasoning modeling typically treated in isolation. This work introduces the first unified framework that deeply integrates auxiliary modality generation—specifically depth and segmentation maps—with an adaptive interleaved reasoning mechanism, enabling models to internalize 3D spatial knowledge. A two-stage joint training strategy simultaneously optimizes both auxiliary modality generation and spatial reasoning capabilities. Experiments demonstrate an average 6.91% improvement on spatial reasoning benchmarks; a variant employing only auxiliary modality generation achieves a 7.92% gain in distance and size estimation without compromising general multimodal understanding performance. The core contribution lies in establishing the first perception–reasoning co-modeling paradigm, offering a novel foundation for embodied spatial intelligence.

Technology Category

Application Category

📝 Abstract
Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose extbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average extbf{6.91%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a extbf{7.92%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.
Problem

Research questions and friction points this paper is trying to address.

Enhance 3D-aware spatial reasoning in MLLMs
Unify perception and reasoning in spatial intelligence
Improve spatial understanding via auxiliary modality generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified MLLM integrates depth and segmentation modalities
Two-stage training for auxiliary modality generation and reasoning
Adaptive interleaved reasoning enhances spatial perception and understanding
🔎 Similar Papers
No similar papers found.