🤖 AI Summary
This work addresses the instability in large language model (LLM) reinforcement learning caused by inaccurate state value estimation. The authors introduce SVEB, a benchmark for state value estimation, which reveals that existing approaches often degrade the value network into a coarse-grained mean estimator. To mitigate this, they propose two novel methods: Numca leverages numerical ranges as measurable milestones to guide value prediction, while Hista performs weighted aggregation of disjoint trajectories and their returns based on LLM hidden states. Integrating these techniques with representation learning and standard RL algorithms such as PPO consistently improves both value estimation accuracy and training stability across diverse model scales and algorithmic configurations, with negligible additional computational overhead.
📝 Abstract
Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.