🤖 AI Summary
Existing vision-language models (VLMs) are constrained by a 2D planar perception paradigm, limiting their ability to model real-world 3D spatial structure—primarily due to computationally expensive geometric encoders incompatible with 2D features at input, and discrete tokenizers incapable of precise continuous-value generation at output. This work proposes GEODE, the first architecture to decouple 3D spatial reasoning from numerical regression. It introduces a spatial coprocessor for efficient geometric understanding and an “embedding-as-value” regression head enabling control-token-driven continuous numerical prediction. Furthermore, a Decoupled Rationale Module fuses cross-modal information and distills a spatial chain-of-thought. With only 1.5B parameters, GEODE matches the performance of 7B-parameter models, achieving state-of-the-art results on both 3D spatial understanding and continuous numerical prediction.
📝 Abstract
Existing Vision Language Models (VLMs) architecturally rooted in"flatland"perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an"Embedding-as-Value"paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.