Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work addresses a key limitation of existing Self-Distillation from Policy Optimization (SDPO) methods—their inability to perceive problem difficulty and thus focus training on the model’s optimal learning zone. To overcome this, we propose SC-SDPO, which constructs loss weights using the square root of zero-cost pass rates, yielding a scale-consistent, difficulty-aware self-distillation objective. This approach implicitly establishes a curriculum learning mechanism that evolves with the model’s capabilities. By integrating advantage normalization analysis, learnability theory, and batch-adaptive normalization, SC-SDPO achieves substantial improvements over SDPO on scientific reasoning and tool-use benchmarks: +3.2/+4.3 points on Qwen3-8B and +1.8/+3.0 points on OLMo-3-7B, while maintaining stable training dynamics.

📝 Abstract

Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDPO's KL-based advantage lacks an implicit notion of difficulty awareness. We analyze this gap through the lens of GRPO's advantage normalization. Extending the learnability framework to normalized rewards, we show that normalization absorbs the variance term $p(1-p)$, equalizing leading-order learnability across questions and leaving $\sqrt{p(1-p)}$ as the sole residual scaling factor in the per-question gradient. This analysis yields a simple prescription: weight each question's SDPO loss by $[\hat{p}(1-\hat{p})]^{1/2}$, resulting in SC-SDPO, a scale-consistent variant of SDPO. The proposed weights are obtained as a zero-cost byproduct of on-policy rollouts with batch-adaptive normalization, inducing an implicit curriculum that dynamically tracks the model's evolving competence. Experiments on scientific reasoning and tool-use benchmarks demonstrate that SC-SDPO consistently improves over SDPO, yielding gains of +3.2/+4.3 (mean@16/maj@16) on Qwen3-8B and +1.8/+3.0 on OLMo-3-7B, while preserving stable training dynamics throughout optimization.

Problem

Research questions and friction points this paper is trying to address.

self-distillation

reinforcement learning

large language models

difficulty awareness

reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation

reinforcement learning

difficulty-aware weighting