MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality multimodal agent trajectories and the prohibitive cost of manual annotation, this paper proposes a vision-centric fine-tuning framework for Vision-Language Models (VLMs) as autonomous agents. Our method introduces three key innovations: (1) construction of M-TRACE, a large-scale, diverse multimodal task dataset; (2) Pref-X, a novel automated pipeline for synthesizing fine-grained, scalable multimodal preference pairs; and (3) an end-to-end optimization strategy integrating trajectory synthesis, behavioral cloning, and stepwise preference learning to jointly refine the VLM controller. Evaluated on three challenging benchmarks—Agent-X, GTA, and GAIA—our approach achieves state-of-the-art performance, significantly outperforming both leading open-source and proprietary VLMs in tool-use accuracy and cross-task robustness.

Technology Category

Application Category

📝 Abstract
Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of high-quality multimodal trajectories for tool-use reasoning
Automatically synthesizes multimodal trajectories and generates preference pairs
Trains VLM controllers for robust tool-use reasoning across benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically synthesizes multimodal trajectories for training
Generates step-wise preference pairs for finer alignment
Trains VLM controller via imitation and preference learning
🔎 Similar Papers
No similar papers found.
Tajamul Ashraf
Tajamul Ashraf
IIT Delhi, MBZUAI
Computer VisionDeep Learning
U
Umair Nawaz
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), United Arab Emirates
A
Abdelrahman M. Shaker
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), United Arab Emirates
R
R. Anwer
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), United Arab Emirates
Philip Torr
Philip Torr
Professor, University of Oxford
Department of Engineering
Fahad Shahbaz Khan
Fahad Shahbaz Khan
MBZUAI, Linköping University Sweden
Computer VisionObject RecognitionGenerative AIAI for Science
S
Salman Khan
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), United Arab Emirates