OpenClaw-RL: Train Any Agent Simply by Talking

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning systems for intelligent agents fail to effectively leverage next-state signals—such as user responses, tool outputs, or interface changes—generated during interaction for online learning. This work proposes a unified framework that, for the first time, models such next-state signals in multimodal interactions as both evaluative and instructive feedback. It introduces an asynchronous, coordination-free online learning architecture enabling any agent to continuously self-improve through natural usage. The approach combines a PRM-based reward model to extract scalar rewards with Hindsight-Guided On-Policy Distillation to generate textual prompts from next-state observations, constructing an augmented teacher context that provides token-level advantage supervision. Experiments demonstrate consistent performance gains in both personal and general agent settings using only user interactions, validating the efficacy of process rewards across terminal, GUI, software engineering, and tool-calling tasks.

Technology Category

Application Category

📝 Abstract
Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL
Problem

Research questions and friction points this paper is trying to address.

next-state signals
agentic reinforcement learning
online learning
universal interaction
policy training
Innovation

Methods, ideas, or system contributions that make the work stand out.

next-state signals
process reward model (PRM)
Hindsight-Guided On-Policy Distillation (OPD)
asynchronous reinforcement learning
universal agent training
🔎 Similar Papers
No similar papers found.