Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) are constrained by a 2D planar perception paradigm, limiting their ability to model real-world 3D spatial structure—primarily due to computationally expensive geometric encoders incompatible with 2D features at input, and discrete tokenizers incapable of precise continuous-value generation at output. This work proposes GEODE, the first architecture to decouple 3D spatial reasoning from numerical regression. It introduces a spatial coprocessor for efficient geometric understanding and an “embedding-as-value” regression head enabling control-token-driven continuous numerical prediction. Furthermore, a Decoupled Rationale Module fuses cross-modal information and distills a spatial chain-of-thought. With only 1.5B parameters, GEODE matches the performance of 7B-parameter models, achieving state-of-the-art results on both 3D spatial understanding and continuous numerical prediction.

Technology Category

Application Category

📝 Abstract
Existing Vision Language Models (VLMs) architecturally rooted in"flatland"perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an"Embedding-as-Value"paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.
Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with 3D spatial reasoning due to architectural limitations
Existing models face input conflicts between 2D features and 3D geometry
Discrete tokenizers cannot generate precise continuous numerical outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples 3D reasoning from numerical generation
Uses plug-and-play spatial co-processor module
Employs Embedding-as-Value paradigm for regression
🔎 Similar Papers
No similar papers found.