🤖 AI Summary
This work addresses the limitations of existing post-training methods, which lack fine-grained reasoning guidance under sparse verifier rewards, and conventional online policy distillation approaches that overlook the interdependencies among multiple rollouts from the same prompt. The authors propose a multi-turn online policy distillation framework that, for the first time, jointly leverages both successful and failed trajectories generated by the student model under identical prompts to construct contrastive teacher signals. This enables conditional, instance-adaptive dense supervision grounded in peer experience. By integrating positive peer imitation with a success-failure contrastive mechanism and modeling multi-trajectory context, the method significantly outperforms standard distillation baselines across programming, mathematical reasoning, scientific question answering, and tool-use tasks, while achieving higher alignment between teacher signals and verifier rewards, thereby validating its efficacy.
📝 Abstract
Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.