VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual tool selection methods suffer from passive invocation, overly strong assumptions about tool diversity, and heavy reliance on manual supervision. This paper proposes a reinforcement learning framework for visual reasoning, enabling agents to autonomously explore, select, and compose diverse visual tools based on task feedback. Its core contribution is Group Relative Policy Optimization (GRPO), the first end-to-end algorithm for tool policy optimization without explicit reasoning supervision, supporting dynamic, query-customized toolchain discovery. The framework integrates a multi-stage visual reasoning architecture with a dynamic tool library invocation mechanism. Evaluated on ChartQA, Geometry3K, and BlindTest, it substantially outperforms zero-shot baselines—particularly on out-of-distribution samples—demonstrating superior generalization and adaptive tool utilization.

Technology Category

Application Category

📝 Abstract
We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
Problem

Research questions and friction points this paper is trying to address.

Dynamic tool selection for visual agents using reinforcement learning
Overcoming limitations of training-free and fine-tuning methods in tool-augmented reasoning
Enhancing generalization and adaptive tool use in visual reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for dynamic tool selection
Group Relative Policy Optimization (GRPO)
End-to-end iterative tool strategy refinement
🔎 Similar Papers
No similar papers found.
Z
Zeyi Huang
University of Wisconsin-Madison
Yuyang Ji
Yuyang Ji
Drexel
Computer visionVision Large Language Model
A
Anirudh Sundara Rajan
University of Wisconsin-Madison
Zefan Cai
Zefan Cai
Student, Peking University
Inference AccelerationMulti-Modality
W
Wen Xiao
Microsoft
J
Junjie Hu
University of Wisconsin-Madison
Y
Yong Jae Lee
University of Wisconsin-Madison