Co-Evolution of Policy and Internal Reward for Language Agents

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that large language model agents often struggle with sparse and delayed environmental rewards in long-horizon tasks, leading to insufficient learning signals. To overcome this limitation, the authors propose a Self-Guide mechanism that generates internal self-guidance signals during inference to inform action selection and, during training, transforms these signals into dense internal rewards. These rewards are jointly optimized with the policy via a novel Guided Reinforcement Policy Optimization (GRPO) algorithm. This approach represents the first method to enable co-evolution of internal rewards and policy in language agents, unifying inference-time guidance with training-time supervision. Experimental results demonstrate that incorporating self-guidance during inference alone yields substantial performance gains; when combined with GRPO-based joint training, the method outperforms pure environment-reward baselines by an average of 8% across three benchmarks.
📝 Abstract
Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.
Problem

Research questions and friction points this paper is trying to address.

sparse reward
delayed reward
language agents
policy improvement
internal reward
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Guide
internal reward
co-evolution
language agents
GRPO