🤖 AI Summary
Visual-language models (VLMs) exhibit limited 3D spatial reasoning and physical understanding due to their reliance on 2D image-based training data, hindering deployment in robotics and embodied AI. To address this, we propose SandboxVLM—a zero-shot framework that enhances 3D capabilities without additional training. It explicitly encodes geometric structure and physical motion properties via abstract bounding-box representations, and integrates multi-view prior generation, agent-elevation guidance, voting-based clustering, and 3D-aware reasoning. The method establishes an end-to-end 3D sandbox reconstruction and perception pipeline, unifying multi-view geometric analysis with abstract control. Evaluated across multiple benchmarks—including SAT Real—and diverse backbone VLMs under zero-shot settings, SandboxVLM achieves significant improvements in spatial reasoning: on SAT Real, it boosts accuracy by 8.3% over strong baselines.
📝 Abstract
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.