ECHO: Terminal Agents Learn World Models for Free

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the challenge that conventional terminal-based reinforcement learning agents, such as those using GRPO, rely solely on sparse task rewards and struggle to learn effectively from failed trajectories. To overcome this limitation, the authors propose ECHO, a hybrid objective function that augments the policy gradient loss with an auxiliary loss predicting environment observation tokens. This formulation treats terminal feedback as an intrinsic supervisory signal, generating dense learning signals without requiring expert demonstrations or additional rollouts. ECHO reuses the forward pass of GRPO, enabling end-to-end training and validator-free self-improvement. Evaluated on TerminalBench-2.0, the method boosts pass@1 performance for Qwen3-8B from 2.70% to 5.17% and for Qwen3-14B from 5.17% to 10.79%, substantially reducing environment token prediction perplexity and approaching the performance of expert SFT+GRPO models.

📝 Abstract

CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.

Problem

Research questions and friction points this paper is trying to address.

CLI agents

world models

environment feedback

sparse rewards

terminal interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

world model

terminal agent

dense supervision