Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language models (VLMs) exhibit limited 3D spatial reasoning and physical understanding due to their reliance on 2D image-based training data, hindering deployment in robotics and embodied AI. To address this, we propose SandboxVLM—a zero-shot framework that enhances 3D capabilities without additional training. It explicitly encodes geometric structure and physical motion properties via abstract bounding-box representations, and integrates multi-view prior generation, agent-elevation guidance, voting-based clustering, and 3D-aware reasoning. The method establishes an end-to-end 3D sandbox reconstruction and perception pipeline, unifying multi-view geometric analysis with abstract control. Evaluated across multiple benchmarks—including SAT Real—and diverse backbone VLMs under zero-shot settings, SandboxVLM achieves significant improvements in spatial reasoning: on SAT Real, it boosts accuracy by 8.3% over strong baselines.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.
Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with 3D spatial cognition and physical understanding
There is a modality gap between 3D tasks and 2D VLM training
Current methods inefficiently retrieve 3D information from 2D input
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages abstract bounding boxes for geometric structure encoding
Uses multi-view voting and clustering for 3D reconstruction
Enables 3D-aware reasoning without additional training
🔎 Similar Papers
No similar papers found.