KL for a KL: On-Policy Distillation with Control Variate Baseline

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the instability of On-Policy Distillation (OPD) caused by high-variance single-sample Monte Carlo gradient estimates and the absence of effective stabilization mechanisms. The authors formalize OPD as a policy gradient reinforcement learning problem and propose vOPD, which introduces a closed-form value function baseline to reduce gradient variance while preserving unbiasedness. This baseline is constructed from per-token negative reverse KL divergences, requiring no additional critic network or inference overhead, and naturally supports an efficient top-k sparse approximation. Experimental results demonstrate that vOPD significantly outperforms the original OPD on mathematical and scientific reasoning benchmarks, achieving performance comparable to computationally expensive full-vocabulary methods while enabling efficient and stable distillation training.
📝 Abstract
On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.
Problem

Research questions and friction points this paper is trying to address.

On-Policy Distillation
gradient variance
training instability
Monte Carlo estimator
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Distillation
control variate
variance reduction
reverse KL divergence
policy gradient