🤖 AI Summary
To address weak tool-use capability and incoherent multi-step reasoning in language model agents, this paper proposes T3-Agent, a multimodal agent specifically designed for tool invocation. Methodologically, we introduce the first multimodal trajectory synthesis pipeline, integrating GPT-4o mini with a two-stage verification mechanism—query-level and trajectory-level—to construct MM-Traj, a high-quality multimodal trajectory dataset comprising 20K multi-step tasks. We further propose Trajectory Tuning, an end-to-end decision-controller fine-tuning paradigm applied to vision-language models (VLMs) such as MiniCPM-V-8.5B and Qwen2-VL-7B. On the GTA and GAIA benchmarks, T3-Agent achieves a 20% absolute improvement over baseline VLMs, significantly enhancing the cross-modal perception–planning–execution closed-loop capability. Our work establishes a scalable data curation and training framework for multi-step tool-using agents.
📝 Abstract
The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via underline{T}rajectory underline{T}uning on VLMs for underline{T}ool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B and {Qwen2-VL-7B}, which outperforms untrained VLMs by $20%$, showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.