🤖 AI Summary
Existing 3D spatial reasoning methods suffer from insufficient geometric computation accuracy, while visual programming approaches rely either on fixed toolsets or inefficient inductive tool discovery. Method: This paper proposes a visual programming framework that dynamically constructs a reusable tool library grounded in problem-solving experience. It introduces a novel transductive tool generation paradigm—free of prior assumptions—that leverages vision-language model–driven program synthesis, pattern abstraction, and exemplar-based feedback to realize a closed-loop evolution: “experience accumulation → pattern distillation → tool refinement.” The tool library autonomously and incrementally optimizes during task solving. Contribution/Results: The framework achieves strong generalization to unseen spatial tasks. On Omni3D-Bench, it outperforms GPT-4o by 22% and surpasses the previous state-of-the-art by 11%. Tool invocation frequency is five times higher than inductive methods, and it attains SOTA on SpatialScore-Hard without any task-specific adaptation.
📝 Abstract
Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at https://transductive-visualprogram.github.io/.