Thinking with Spatial Code for Physical-World Video Reasoning

๐Ÿ“… 2026-03-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the lack of explicit, temporally consistent 3D spatial representations in video-based visual question answering grounded in the physical world. To this end, the authors propose a spatial encoding framework that transforms RGB videos into explicit 3D spatial representations by jointly performing 6D object pose estimation, multi-object tracking, and geometric prediction within a unified spatial encoderโ€”marking the first integration of these three tasks. Building upon this representation, they introduce a spatial scoring reward mechanism to fine-tune large language models via reinforcement learning, enabling perspective-aware and geometry-grounded reasoning based on explicit 3D bounding boxes and semantic labels. The proposed method achieves state-of-the-art performance on the VSI-Bench benchmark, outperforming existing closed-source vision-language models.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at https://github.com/Beckschen/spatialcode.
Problem

Research questions and friction points this paper is trying to address.

video reasoning
spatial reasoning
visual question answering
3D representation
physical-world understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Code
3D Video Reasoning
Geometrically Grounded LLMs
6D Object Parsing
Reinforcement Learning for VQA