PyVision: Agentic Vision with Dynamic Tooling

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing visual reasoning approaches are constrained by predefined workflows and static toolsets, limiting flexibility and interpretability. This paper proposes a multi-round interactive framework enabling multimodal large models to autonomously generate, execute, and iteratively refine Python-based tools tailored to visual tasks. Its core innovation is a dynamic tool generation mechanism: rather than relying on a fixed tool library, the model synthesizes executable code tools on-the-fly according to task requirements, with execution feedback driving iterative refinement across multiple rounds. Integrating multimodal perception, program synthesis, and closed-loop execution, the framework establishes the first evolvable visual tool invocation system. Experiments demonstrate substantial performance gains across multiple benchmarks: +7.8% on V* for GPT-4.1 and +31.1% on VLMsAreBlind-mini for Claude-4.0-Sonnet, validating both effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract

LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enabling MLLMs to dynamically generate and refine Python-based tools

Overcoming limitations of predefined workflows in visual reasoning

Advancing agentic visual reasoning through dynamic tool invention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Python-based tool generation

Interactive multi-turn vision framework

Autonomous tool refinement for tasks

🔎 Similar Papers

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs