SODA: Semi On-Policy Black-Box Distillation for Large Language Models

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the limitations of existing black-box knowledge distillation approaches, where off-policy methods struggle to correct inherent student errors and fully on-policy strategies suffer from training instability and high computational costs. To overcome these challenges, the paper introduces SODA, the first semi-on-policy distillation framework, which constructs effective contrastive signals by pairing the teacher model’s optimal responses with static snapshots of the student’s outputs. This approach achieves high-quality distribution alignment using only static, suboptimal student behaviors—eliminating the need for dynamic inference or adversarial training. Empirical results demonstrate that SODA matches or surpasses state-of-the-art methods on 15 out of 16 benchmarks, accelerates training by up to 10×, reduces peak GPU memory usage by 27%, and completely removes adversarial instability.

Technology Category

Application Category

📝 Abstract

Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student's inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model's natural, zero-shot responses are almost strictly inferior to the powerful teacher's targets, we can construct a highly effective contrastive signal simply by pairing the teacher's optimal response with a one-time static snapshot of the student's outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.

Problem

Research questions and friction points this paper is trying to address.

black-box knowledge distillation

on-policy distillation

training instability

computational overhead

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

semi on-policy distillation

black-box knowledge distillation

contrastive alignment