Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses critical challenges in offline policy distillation (OPD), where student models generating their own training data often suffer from output length inflation, trajectory truncation, and training instability. The study is the first to uncover the intrinsic mechanisms linking length inflation to gradient bias. To mitigate these issues, the authors propose StableOPD, a novel framework that stabilizes training dynamics by constraining the student’s generation behavior through divergence control relative to a reference model and incorporating mixed-trajectory distillation. This approach effectively suppresses repetition saturation and truncation-induced collapse, yielding an average performance gain of 7.2% across multiple mathematical reasoning benchmarks while significantly improving training stability and generalization capability.

Technology Category

Application Category

📝 Abstract

On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.

Problem

Research questions and friction points this paper is trying to address.

On-policy distillation

length inflation

truncation collapse

training instability

repetition saturation

Innovation

Methods, ideas, or system contributions that make the work stand out.

On-policy Distillation

Length Inflation

StableOPD