Milestone-Guided Policy Learning for Long-Horizon Language Agents

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenges of credit misassignment and low sample efficiency in long-horizon language agents trained via reinforcement learning by introducing the BEACON framework. Leveraging the compositional structure of tasks, BEACON segments trajectories using milestones as anchors and employs segmented temporal reward shaping together with dual-scale advantage estimation to achieve precise credit assignment. This approach effectively decouples local action evaluation from interference caused by distant failures. Experimental results demonstrate that BEACON substantially outperforms GRPO and GiGPO across ALFWorld, WebShop, and ScienceWorld benchmarks. Notably, it achieves a 92.9% success rate on long-horizon ALFWorld tasks and improves sample utilization efficiency from 23.7% to 82.0%.

📝 Abstract

While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO's 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is available at https://github.com/ZJU-REAL/BEACON.

Problem

Research questions and friction points this paper is trying to address.

credit misattribution

sample inefficiency

long-horizon tasks

language agents

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

milestone-guided learning

credit assignment

long-horizon tasks