🤖 AI Summary
This work investigates the critical role of the reward extrapolation coefficient λ in output policy distillation (OPD) for structured generation tasks. While λ > 1 can enhance performance, exceeding a critical threshold λ* induces format collapse. The study derives, for the first time, a closed-form expression for the safe threshold λ*, determined by the teacher’s modal probability, warm-up quality, and importance sampling clipping strength, thereby revealing the underlying mechanism governing format preservation versus collapse. The analysis is extended to K-ary list-style JSON generation. Using a single-position Bernoulli approximation, the ListOPD framework, and parse validity evaluation, preregistered experiments on the Amazon Fashion dataset demonstrate that a 1.7B-parameter Qwen3 student model matches the performance of an 8B-SFT baseline at one-fifth the parameter count when λ ≤ λ*. Beyond λ*, parse validity sharply declines while NDCG@1 remains stable, confirming the accuracy of the predicted threshold.
📝 Abstract
On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.