Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

In strong-to-weak policy distillation, full-trajectory supervised training often suffers from inefficiency due to “local teachability collapse” in later segments. This work formally defines this phenomenon and introduces a trajectory-adaptive supervision release mechanism that dynamically truncates uninformative supervision regions while preserving only the most discriminative portions of teacher feedback for training. The method leverages NLTK sentence segmentation, Top-K candidate margin analysis, and Bayesian Information Criterion (BIC)-based change-point detection to identify effective supervision boundaries. Evaluated across multiple student models in the Qwen3 series, the approach consistently outperforms full-trajectory distillation on five in-domain benchmarks and demonstrates superior generalization on out-of-domain tasks.

📝 Abstract

On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.

Problem

Research questions and friction points this paper is trying to address.

on-policy distillation

strong-to-weak

local teachability collapse

dense feedback

trajectory supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation

local teachability collapse

trajectory-specific release rule