The Unreasonable Effectiveness of Scaling Agents for Computer Use

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Current computer-using agents (CUAs) exhibit low reliability and high performance variance on long-horizon, complex digital tasks. To address this, we propose Behavior Best-of-N (bBoN), the first framework to integrate scalable agent architectures with behavioral narrative modeling: it generates diverse execution trajectories via multi-agent rollouts, employs behavior narratives for structured trajectory modeling and evaluation, and introduces a reinforcement learning–driven selection mechanism. bBoN significantly improves robustness and cross-platform generalization, achieving 69.9% task success rate on OSWorld—approaching human performance (72%)—and is validated on WindowsAgentArena and AndroidWorld, establishing new state-of-the-art results. Its core contribution lies in establishing a behavior-centric paradigm for scalable CUAs, providing an extensible technical pathway toward reliable, general-purpose computer-use automation.

Technology Category

Application Category

📝 Abstract

Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents'rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSWorld, our bBoN scaling method establishes a new state of the art (SoTA) at 69.9%, significantly outperforming prior methods and approaching human-level performance at 72%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the unreasonable effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this.

Problem

Research questions and friction points this paper is trying to address.

Improving reliability of computer-use agents for complex tasks

Scaling agents through behavior-based trajectory selection

Achieving human-level performance in automated digital tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling agents with multiple rollouts generation

Selecting trajectories using behavior narratives

Achieving state-of-the-art performance across operating systems

🔎 Similar Papers

Scaling Large-Language-Model-based Multi-Agent Collaboration