Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

📅 2024-12-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak tool-use capability and incoherent multi-step reasoning in language model agents, this paper proposes T3-Agent, a multimodal agent specifically designed for tool invocation. Methodologically, we introduce the first multimodal trajectory synthesis pipeline, integrating GPT-4o mini with a two-stage verification mechanism—query-level and trajectory-level—to construct MM-Traj, a high-quality multimodal trajectory dataset comprising 20K multi-step tasks. We further propose Trajectory Tuning, an end-to-end decision-controller fine-tuning paradigm applied to vision-language models (VLMs) such as MiniCPM-V-8.5B and Qwen2-VL-7B. On the GTA and GAIA benchmarks, T3-Agent achieves a 20% absolute improvement over baseline VLMs, significantly enhancing the cross-modal perception–planning–execution closed-loop capability. Our work establishes a scalable data curation and training framework for multi-step tool-using agents.

Technology Category

Application Category

📝 Abstract
The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via underline{T}rajectory underline{T}uning on VLMs for underline{T}ool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B and {Qwen2-VL-7B}, which outperforms untrained VLMs by $20%$, showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.
Problem

Research questions and friction points this paper is trying to address.

Language Models
Tool Usage
Task Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Language Model (VLM)
Tool Usage Learning
T3-Agent
🔎 Similar Papers
No similar papers found.
Z
Zhi Gao
School of Intelligence Science and Technology, Peking University; State Key Laboratory of General Artificial Intelligence, BIGAI
Bofei Zhang
Bofei Zhang
BIGAI
Pengxiang Li
Pengxiang Li
Beijing Institute of Technology
Multimodal AgentVision and Language3DVHyperbolic Learning
Xiaojian Ma
Xiaojian Ma
University of California, Los Angeles
Computer VisionMachine LearningGenerative ModelingReinforcement Learning
Tao Yuan
Tao Yuan
University of California, Los Angeles
Computer VisionArtificial Intelligence
Y
Yue Fan
State Key Laboratory of General Artificial Intelligence, BIGAI
Yuwei Wu
Yuwei Wu
Ph.D. candidate, GRASP Lab, University of Pennsylvania
RoboticsTrajectory OptimizationTask and Motion Planning
Y
Yunde Jia
Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University; Beijing Key Laboratory of Intelligent Information Technology, Beijing Institute of Technology
S
Song-Chun Zhu
School of Intelligence Science and Technology, Peking University; State Key Laboratory of General Artificial Intelligence, BIGAI; Department of Automation, Tsinghua University
Q
Qing Li
State Key Laboratory of General Artificial Intelligence, BIGAI