🤖 AI Summary
PPO exhibits poor stability and frequent training collapse when fine-tuning large language models (LLMs) on multi-turn dialogue and reasoning tasks. This stems from two key issues: (1) token-level importance sampling misaligns with the inherent turn-level decision structure of dialogues, and (2) off-policy advantage estimation introduces high-variance gradients due to bias accumulation. To address these, we propose a turn-level PPO framework, where entire dialogue turns—not individual tokens—serve as the fundamental units for sampling and policy update. We further introduce a truncated bias-correction mechanism to mitigate off-policy estimation bias. Based on this foundation, we instantiate three variants—Turn-PPO, S-PPO, and ST-PPO—integrating turn-level sampling, value-function gradient normalization, and adaptive clipping. Experiments across general QA, multi-hop reasoning, and medical multiple-choice tasks demonstrate that ST-PPO and S-PPO significantly outperform standard PPO, achieving lower clipping rates, stable convergence throughout training, and enhanced scalability and robustness of multi-turn LLM agents.
📝 Abstract
PPO has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1)~token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn-level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.