Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

📅 2024-12-20

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

To address weak tool-use capability and incoherent multi-step reasoning in language model agents, this paper proposes T3-Agent, a multimodal agent specifically designed for tool invocation. Methodologically, we introduce the first multimodal trajectory synthesis pipeline, integrating GPT-4o mini with a two-stage verification mechanism—query-level and trajectory-level—to construct MM-Traj, a high-quality multimodal trajectory dataset comprising 20K multi-step tasks. We further propose Trajectory Tuning, an end-to-end decision-controller fine-tuning paradigm applied to vision-language models (VLMs) such as MiniCPM-V-8.5B and Qwen2-VL-7B. On the GTA and GAIA benchmarks, T3-Agent achieves a 20% absolute improvement over baseline VLMs, significantly enhancing the cross-modal perception–planning–execution closed-loop capability. Our work establishes a scalable data curation and training framework for multi-step tool-using agents.

Technology Category

Application Category

📝 Abstract

The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via underline{T}rajectory underline{T}uning on VLMs for underline{T}ool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B and {Qwen2-VL-7B}, which outperforms untrained VLMs by $20%$, showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.

Problem

Research questions and friction points this paper is trying to address.

Language Models

Tool Usage

Task Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Language Model (VLM)

Tool Usage Learning

T3-Agent

🔎 Similar Papers

MLLM-Tool: A Multimodal Large Language Model for Tool Agent Learning