🤖 AI Summary
This work addresses a key limitation in existing selective online policy distillation methods, which rely on KL divergence or entropy to select training tokens yet struggle to distinguish between learnable and non-learnable teacher signals. To overcome this, the paper introduces the concept of “token teachability,” defined by the teacher’s ability to reallocate probability mass within the student’s current support set. Leveraging fixed-context diagnostics, the authors compute the reduction in KL divergence to identify highly teachable tokens and propose TA-OPD, a lightweight algorithm that performs efficient token selection without requiring a reward model. Experiments demonstrate that, in a Qwen2.5→Qwen3 distillation setting, TA-OPD using only the top 5% most teachable tokens significantly outperforms full-token distillation as well as state-of-the-art baselines based on entropy or divergence metrics.
📝 Abstract
On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.