PyVision: Agentic Vision with Dynamic Tooling

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual reasoning approaches are constrained by predefined workflows and static toolsets, limiting flexibility and interpretability. This paper proposes a multi-round interactive framework enabling multimodal large models to autonomously generate, execute, and iteratively refine Python-based tools tailored to visual tasks. Its core innovation is a dynamic tool generation mechanism: rather than relying on a fixed tool library, the model synthesizes executable code tools on-the-fly according to task requirements, with execution feedback driving iterative refinement across multiple rounds. Integrating multimodal perception, program synthesis, and closed-loop execution, the framework establishes the first evolvable visual tool invocation system. Experiments demonstrate substantial performance gains across multiple benchmarks: +7.8% on V* for GPT-4.1 and +31.1% on VLMsAreBlind-mini for Claude-4.0-Sonnet, validating both effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract
LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enabling MLLMs to dynamically generate and refine Python-based tools
Overcoming limitations of predefined workflows in visual reasoning
Advancing agentic visual reasoning through dynamic tool invention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Python-based tool generation
Interactive multi-turn vision framework
Autonomous tool refinement for tasks
🔎 Similar Papers
No similar papers found.
Shitian Zhao
Shitian Zhao
Shanghai AI Lab
LLMMLLMGenerative Model
Haoquan Zhang
Haoquan Zhang
SphereLab, CUHK
MLLM
S
Shaoheng Lin
Shanghai AI Lab
M
Ming Li
Shanghai AI Lab
Q
Qilong Wu
NUS
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC
C
Chen Wei
Rice University