🤖 AI Summary
Multimodal agents heavily rely on extensive expert-annotated data for fine-tuning in novel environments, severely limiting deployment flexibility and cross-environment generalization. To address this, we propose SPORT, the first step-level preference-driven self-exploration framework: a large language model autonomously generates diverse tasks; an AI validator enables online sampling and verification of execution steps; and step-level preference data—derived from verified trajectories—is used to iteratively refine the controller policy. This establishes a closed-loop paradigm of “task self-generation → solution self-validation → policy self-improvement”, eliminating dependence on human annotations entirely. On the GTA and GAIA benchmarks, SPORT achieves absolute improvements of +6.41% and +3.64%, respectively, demonstrating significantly enhanced generalization across unseen environments and robustness in complex task solving.
📝 Abstract
Multimodal agents, which integrate a controller (e.g., a large language model) with external tools, have demonstrated remarkable capabilities in tackling complex tasks. However, existing agents need to collect a large number of expert data for fine-tuning to adapt to new environments. In this paper, we propose an online self-exploration method for multimodal agents, namely SPORT, via step-wise preference optimization to refine the trajectories of agents, which automatically generates tasks and learns from solving the generated tasks, without any expert annotation. SPORT operates through four iterative components: task synthesis, step sampling, step verification, and preference tuning. First, we synthesize multi-modal tasks using language models. Then, we introduce a novel search scheme, where step sampling and step verification are executed alternately to solve each generated task. We employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller's policy through preference tuning, producing a SPORT Agent. By interacting with real environments, the SPORT Agent evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks show that the SPORT Agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.