Iterative Trajectory Exploration for Multimodal Agents

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal agents heavily rely on extensive expert-annotated data for fine-tuning in novel environments, severely limiting deployment flexibility and cross-environment generalization. To address this, we propose SPORT, the first step-level preference-driven self-exploration framework: a large language model autonomously generates diverse tasks; an AI validator enables online sampling and verification of execution steps; and step-level preference data—derived from verified trajectories—is used to iteratively refine the controller policy. This establishes a closed-loop paradigm of “task self-generation → solution self-validation → policy self-improvement”, eliminating dependence on human annotations entirely. On the GTA and GAIA benchmarks, SPORT achieves absolute improvements of +6.41% and +3.64%, respectively, demonstrating significantly enhanced generalization across unseen environments and robustness in complex task solving.

Technology Category

Application Category

📝 Abstract
Multimodal agents, which integrate a controller (e.g., a large language model) with external tools, have demonstrated remarkable capabilities in tackling complex tasks. However, existing agents need to collect a large number of expert data for fine-tuning to adapt to new environments. In this paper, we propose an online self-exploration method for multimodal agents, namely SPORT, via step-wise preference optimization to refine the trajectories of agents, which automatically generates tasks and learns from solving the generated tasks, without any expert annotation. SPORT operates through four iterative components: task synthesis, step sampling, step verification, and preference tuning. First, we synthesize multi-modal tasks using language models. Then, we introduce a novel search scheme, where step sampling and step verification are executed alternately to solve each generated task. We employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller's policy through preference tuning, producing a SPORT Agent. By interacting with real environments, the SPORT Agent evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks show that the SPORT Agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.
Problem

Research questions and friction points this paper is trying to address.

Online self-exploration for multimodal agents without expert data
Step-wise preference optimization to refine agent trajectories
Automated task generation and learning for improved agent performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online self-exploration method SPORT
Step-wise preference optimization for trajectories
Automatic task generation and learning
🔎 Similar Papers
No similar papers found.
Pengxiang Li
Pengxiang Li
Beijing Institute of Technology
Multimodal AgentVision and Language3DVHyperbolic Learning
Z
Zhi Gao
State Key Laboratory of General Artificial Intelligence, BIGAI; School of Intelligence Science and Technology, Peking University
Bofei Zhang
Bofei Zhang
BIGAI
Y
Yapeng Mi
State Key Laboratory of General Artificial Intelligence, BIGAI; Harbin Institute of Technology
Xiaojian Ma
Xiaojian Ma
University of California, Los Angeles
Computer VisionMachine LearningGenerative ModelingReinforcement Learning
Chenrui Shi
Chenrui Shi
Beijing Institute of Technology
anomaly detection
Tao Yuan
Tao Yuan
University of California, Los Angeles
Computer VisionArtificial Intelligence
Yuwei Wu
Yuwei Wu
Ph.D. candidate, GRASP Lab, University of Pennsylvania
RoboticsTrajectory OptimizationTask and Motion Planning
Y
Yunde Jia
Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University
S
Song-Chun Zhu
State Key Laboratory of General Artificial Intelligence, BIGAI; Department of Automation, Tsinghua University
Q
Qing Li
State Key Laboratory of General Artificial Intelligence, BIGAI