Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

πŸ“… 2026-05-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

176K/year
πŸ€– AI Summary
In strong-to-weak policy distillation, full-trajectory supervised training often suffers from inefficiency due to β€œlocal teachability collapse” in later segments. This work formally defines this phenomenon and introduces a trajectory-adaptive supervision release mechanism that dynamically truncates uninformative supervision regions while preserving only the most discriminative portions of teacher feedback for training. The method leverages NLTK sentence segmentation, Top-K candidate margin analysis, and Bayesian Information Criterion (BIC)-based change-point detection to identify effective supervision boundaries. Evaluated across multiple student models in the Qwen3 series, the approach consistently outperforms full-trajectory distillation on five in-domain benchmarks and demonstrates superior generalization on out-of-domain tasks.
πŸ“ Abstract
On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.
Problem

Research questions and friction points this paper is trying to address.

on-policy distillation
strong-to-weak
local teachability collapse
dense feedback
trajectory supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation
local teachability collapse
trajectory-specific release rule
strong-to-weak distillation
dense feedback
πŸ”Ž Similar Papers
No similar papers found.