pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of multimodal large language models in 3D spatial understanding by introducing the first zero-shot, fine-tuning-free framework for 3D spatial reasoning. The proposed method guides a multimodal large language model to generate executable Python vision programs that invoke specialized tools—including 3D reconstruction, camera pose estimation, and novel view synthesis—to transform a sequence of 2D images into an interactive 3D scene, thereby enabling explicit spatial reasoning. Evaluated on MindCube and Omni3D-Bench, the approach significantly outperforms existing baselines, achieving a 12.94% improvement over GPT-4.1-mini, and successfully generates feasible navigation paths in real-world indoor environments.

Technology Category

Application Category

📝 Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
3D understanding
multi-modal large language models
zero-shot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual programming
zero-shot spatial reasoning
3D reconstruction
multi-modal LLMs
Python code generation
🔎 Similar Papers
No similar papers found.