Agent Learning via Early Experience

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing language agents rely heavily on expert demonstrations for supervised fine-tuning, resulting in poor generalization and limited scalability. To address this, we propose the “Early Experience” learning paradigm—a self-supervised framework enabling agents to learn policies solely from sequences of states generated through iterative environmental interaction, without explicit rewards or expert data. Our approach integrates implicit world modeling with a self-reflection mechanism, facilitating cross-task continual improvement across eight heterogeneous real-world environments. Experiments demonstrate substantial gains in task performance and out-of-domain generalization. Moreover, in settings with verifiable reward signals, the paradigm exhibits smooth convergence toward reinforcement learning objectives. This work establishes a novel pathway toward expert-free, autonomously evolving language agents—advancing beyond demonstration-dependent paradigms toward intrinsic, experience-driven policy acquisition.

Technology Category

Application Category

📝 Abstract

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

Problem

Research questions and friction points this paper is trying to address.

Training language agents through experience without reward signals in diverse environments

Overcoming limitations of supervised fine-tuning using agent-generated interaction data

Improving agent generalization via early experience strategies like world modeling and self-reflection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early experience paradigm uses agent's own interaction data

Implicit world modeling grounds policy in environment dynamics

Self-reflection learns from suboptimal actions to improve reasoning

🔎 Similar Papers

A Role of Environmental Complexity on Representation Learning in Deep Reinforcement Learning Agents