🤖 AI Summary
Existing visual reasoning approaches are constrained by predefined workflows and static toolsets, limiting flexibility and interpretability. This paper proposes a multi-round interactive framework enabling multimodal large models to autonomously generate, execute, and iteratively refine Python-based tools tailored to visual tasks. Its core innovation is a dynamic tool generation mechanism: rather than relying on a fixed tool library, the model synthesizes executable code tools on-the-fly according to task requirements, with execution feedback driving iterative refinement across multiple rounds. Integrating multimodal perception, program synthesis, and closed-loop execution, the framework establishes the first evolvable visual tool invocation system. Experiments demonstrate substantial performance gains across multiple benchmarks: +7.8% on V* for GPT-4.1 and +31.1% on VLMsAreBlind-mini for Claude-4.0-Sonnet, validating both effectiveness and generalizability.
📝 Abstract
LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.