🤖 AI Summary
This work addresses the challenges of distilling large-scale diffusion models for industrial-scale text-to-image (T2I) generation, where MeanFlow suffers from optimization instability and a “mean-seeking bias” that hinders scalability. To overcome these limitations, the authors propose a two-stage optimization strategy: a warm-up phase that alternates between discrete and differential solver objectives to stabilize training, complemented by a novel trajectory distribution alignment mechanism designed to mitigate mean-seeking bias. The approach significantly enhances distillation stability and sample quality, outperforming existing distillation methods on both FLUX.1-dev (12B) and HunyuanImage 3.0 (80B). These results demonstrate the method’s strong generalization capability at the 10-billion-parameter scale and its suitability for real-world T2I applications.
📝 Abstract
Diffusion models exhibit remarkable generative capability, but their high latency limits practical deployment. Many studies have attempted to reduce sampling steps to accelerate inference. Among them, MeanFlow has attracted considerable attention due to its concise formulation and remarkable performance. Nevertheless, the instability of its optimization objective and the ''mean-seeking bias'' have limited its applicability to distill large-scale industrial models. To stabilize MeanFlow for distilling large-scale models, we first introduce a warm-up technique, in which the original differential solution of MeanFlow is replaced by a discrete solution. This design avoids training collapse caused by the MeanFlow target containing a stop-gradient term from an undertrained model. Once the model acquires a preliminary ability to fit the average velocity field, we switch the optimization objective back to the differential solution, enabling further refinement. Meanwhile, to alleviate the ''mean-seeking bias'' of MeanFlow under extremely few-step inference with complex target distributions, we incorporate trajectory distribution alignment as an auxiliary objective, encouraging the student model's trajectory distribution to align more closely with that of the teacher model. Our proposed distillation framework achieves superior performance compared to existing distillation approaches when applied to the text-to-image (T2I) model FLUX.1-dev (up to 12B parameters). Furthermore, when extended to the 80B-parameter state-of-the-art (SOTA) T2I model HunyuanImage 3.0, our method continues to demonstrate robust generalization and strong performance.