🤖 AI Summary
Flow Matching (FM) text-to-image models suffer from low sampling efficiency, particularly due to their reliance on multi-step iterative solvers. To address this, we propose Score identity Distillation (SiD), the first method to directly adapt score distillation to DiT-based FM models. Leveraging a Bayesian derivation, SiD establishes a unified perspective on score-field modeling between diffusion and flow matching, enabling distillation without teacher-model fine-tuning or architectural modifications. It supports both data-free and data-assisted distillation paradigms. Evaluated on state-of-the-art models—including SANA, SD3, and FLUX.1-dev—SiD achieves high-fidelity, text-aligned image generation in merely one or few steps after lightweight adaptation. This yields substantial inference speedup while preserving visual quality and semantic alignment. SiD thus provides a unified, efficient acceleration framework bridging diffusion and flow-matching generative paradigms.
📝 Abstract
Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. We will make the PyTorch implementation publicly available.