🤖 AI Summary
This work proposes Adversarial Flow Distillation (AFD), a framework for distilling non-causal, heterogeneous video generation capabilities from a black-box teacher model into a causal autoregressive student model, under the constraint of accessing only the teacher’s outputs. AFD addresses the challenges of off-policy training and sparse rewards by simultaneously querying the teacher and replaying the student under identical prompts to form sample pairs. A Bradley–Terry discriminator estimates the teacher–student discrepancy, and the resulting advantage signal is converted into flow-matching supervision within a forward diffusion process, enabling dense velocity field learning. This approach achieves, for the first time, end-to-end online policy distillation without requiring teacher scores, latent variables, denoising trajectories, or step-wise alignment, and introduces a forward-process-based credit assignment mechanism. Experiments demonstrate that AFD significantly improves generation quality in motion- and physics-sensitive scenarios across two autoregressive student architectures while preserving overall video fidelity.
📝 Abstract
Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribution, whereas practical teachers may expose only prompt-conditioned completed videos and may differ in architecture, capacity, temporal design, and sampling schedule. This interface makes supervised fine-tuning off-policy, score-based distillation inapplicable, and direct adversarial imitation too sparse for denoising-time credit assignment. We propose Adversarial Flow Distillation (AFD), an on-policy framework for heterogeneous black-box video distillation. AFD queries the teacher and rolls out the current student on the same prompts, trains a prompt-paired Bradley-Terry discriminator to estimate clean-sample teacher-student discrepancy, and converts the resulting on-policy advantage into forward-process flow-matching updates on the student's own noised states. Thus, AFD provides dense velocity-field supervision while requiring no teacher scores, latents, denoising trajectories, step alignment, or reverse-chain reinforcement learning. Experiments across two causal AR student families show that AFD consistently improves motion- and physics-sensitive generation while preserving general video quality, and ablations validate the importance of adaptive on-policy feedback and forward-process credit assignment. The method requires only clean teacher videos and student rollouts, providing a practical route for distilling proprietary or heterogeneous video generators into efficient autoregressive students.