The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
This work investigates the critical role of the reward extrapolation coefficient λ in output policy distillation (OPD) for structured generation tasks. While λ > 1 can enhance performance, exceeding a critical threshold λ* induces format collapse. The study derives, for the first time, a closed-form expression for the safe threshold λ*, determined by the teacher’s modal probability, warm-up quality, and importance sampling clipping strength, thereby revealing the underlying mechanism governing format preservation versus collapse. The analysis is extended to K-ary list-style JSON generation. Using a single-position Bernoulli approximation, the ListOPD framework, and parse validity evaluation, preregistered experiments on the Amazon Fashion dataset demonstrate that a 1.7B-parameter Qwen3 student model matches the performance of an 8B-SFT baseline at one-fifth the parameter count when λ ≤ λ*. Beyond λ*, parse validity sharply declines while NDCG@1 remains stable, confirming the accuracy of the predicted threshold.
📝 Abstract
On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.
Problem

Research questions and friction points this paper is trying to address.

on-policy distillation
structured outputs
reward extrapolation
format collapse
output contract
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation
extrapolation cliff
clip-safety threshold
structured outputs
reward extrapolation
🔎 Similar Papers
No similar papers found.