🤖 AI Summary
This work addresses the training bottleneck faced by end-to-end GUI agents in real-world desktop environments due to the scarcity of high-quality interaction data. To overcome this limitation, the authors propose a state-change-based branching mechanism coupled with a task-conditioned filtering strategy. Starting from a small set of seed trajectories, the method automatically generates high-fidelity, diverse, and goal-consistent GUI interaction data through state-aware trajectory expansion, step-level task-conditioned filtering, post-branch denoising, and consistency validation. Experimental results on the OSWorld and WindowsAgentArena benchmarks demonstrate that models fine-tuned on the generated data significantly outperform zero-shot agents and existing synthetic data approaches, while also exhibiting strong generalization across applications and operating systems.
📝 Abstract
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data, yet collecting human demonstrations is expensive and existing synthetic pipelines often suffer from limited task diversity or noisy, goal-drifting trajectories. We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations. Starting from each seed, we identify branch points that correspond to meaningful state changes and propose new, state-grounded task variants conditioned on the current GUI context. An executing agent then follows the proposed instructions to generate new trajectories, while a verifier enforces task completion via state-aware checks and trajectory-level consistency. To improve supervision quality, we further apply task-conditioned step-level filtering to remove ungrounded actions and denoise post-branch segments to maintain coherent intent. Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements over zero-shot agents and representative synthesis baselines, and generalize across applications and operating systems.