π€ AI Summary
This work addresses the challenge of continual learning for agents in dynamic digital environments, where distribution shifts and unfamiliar scenarios often hinder adaptation, compounded by a scarcity of high-quality, environment-relevant unlabeled data. To overcome this, the authors propose ACuRL, a novel autonomous curriculum reinforcement learning framework that operates without any human-provided data. ACuRL initiates learning through self-exploration, then iteratively synthesizes new tasks tailored to the agentβs current capabilities via a curriculum task generator informed by historical feedback. A key component is CUAJudge, a highly consistent automated evaluator that achieves 93% agreement with human judgments. The framework enables both intra- and cross-environment continual learning while mitigating catastrophic forgetting, yielding performance gains of 4%β22% with updates to only approximately 20% of the model parameters.
π Abstract
Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent's current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at https://github.com/OSU-NLP-Group/ACuRL.