š¤ AI Summary
This work addresses the limited spatial reasoning capabilities of vision-language models (VLMs), which, despite generating 3D primitive code containing object categories, counts, and coarse locations, often fail to achieve accurate spatial understanding. To bridge this gap, the authors propose using executable 3D geometric primitive code as an intermediate representation for spatial reasoning and introduce three key contributions: the SpatialBabel benchmark to evaluate the impact of multilingual scene code on VLM performance, a training-free Code-CoT reasoning strategy, and an unsupervised S³-FT fine-tuning method that leverages model-generated primitives for knowledge distillation. Experiments demonstrate that Code-CoT improves performance by 6.4% and 5.0% on SpatialBabel-QA and CV-Bench-3D, respectively, while S³-FT boosts Qwen3-VL-8B by 4.6%ā17% across multiple benchmarks, with the approach showing strong cross-model generalizability.
š Abstract
Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own Three.js primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.