MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the scarcity of high-quality multimodal agent trajectories and the prohibitive cost of manual annotation, this paper proposes a vision-centric fine-tuning framework for Vision-Language Models (VLMs) as autonomous agents. Our method introduces three key innovations: (1) construction of M-TRACE, a large-scale, diverse multimodal task dataset; (2) Pref-X, a novel automated pipeline for synthesizing fine-grained, scalable multimodal preference pairs; and (3) an end-to-end optimization strategy integrating trajectory synthesis, behavioral cloning, and stepwise preference learning to jointly refine the VLM controller. Evaluated on three challenging benchmarks—Agent-X, GTA, and GAIA—our approach achieves state-of-the-art performance, significantly outperforming both leading open-source and proprietary VLMs in tool-use accuracy and cross-task robustness.

Technology Category

Application Category

📝 Abstract

Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.

Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of high-quality multimodal trajectories for tool-use reasoning

Automatically synthesizes multimodal trajectories and generates preference pairs

Trains VLM controllers for robust tool-use reasoning across benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically synthesizes multimodal trajectories for training

Generates step-wise preference pairs for finer alignment

Trains VLM controller via imitation and preference learning

🔎 Similar Papers

Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?