MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses exposure bias in diffusion models—arising from input distribution mismatch between training (using ground-truth noise interpolations) and inference (using generated noise interpolations). To mitigate this, we propose MixFlow, a novel training paradigm. Its core innovation is the first identification and exploitation of the “Slow Flow” phenomenon: we introduce noise-schedule-aware high-noise-step interpolation mixing (“slowed interpolation mixture”), aligning the training objective with the actual “slowed timesteps” fed to the prediction network during sampling—thereby alleviating exposure bias at the mechanistic level. MixFlow is compatible with post-training fine-tuning and prediction-network optimization. On ImageNet, the RAE model trained with MixFlow achieves 1.43 FID (unconditional) and 1.10 FID (classifier-guided), significantly outperforming baselines at both 256×256 and 512×512 resolutions.

Technology Category

Application Category

📝 Abstract
This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.
Problem

Research questions and friction points this paper is trying to address.

Addresses exposure bias in diffusion models training
Proposes MixFlow using slowed interpolation mixture
Improves image generation performance on benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

MixFlow uses slowed interpolation mixture for training
It addresses exposure bias in diffusion model training
Post-trains prediction network with slowed timestep interpolations
H
Hui Li
Fudan University
J
Jiayue Lyu
Fudan University
Fu-Yun Wang
Fu-Yun Wang
Ph.D. candidate, Chinese University of Hong Kong
machine learningcomputer vision
K
Kaihui Cheng
Fudan University
Siyu Zhu
Siyu Zhu
LinkedIn
LLM | Ranking
J
Jingdong Wang
Baidu