🤖 AI Summary
This paper addresses exposure bias in diffusion models—arising from input distribution mismatch between training (using ground-truth noise interpolations) and inference (using generated noise interpolations). To mitigate this, we propose MixFlow, a novel training paradigm. Its core innovation is the first identification and exploitation of the “Slow Flow” phenomenon: we introduce noise-schedule-aware high-noise-step interpolation mixing (“slowed interpolation mixture”), aligning the training objective with the actual “slowed timesteps” fed to the prediction network during sampling—thereby alleviating exposure bias at the mechanistic level. MixFlow is compatible with post-training fine-tuning and prediction-network optimization. On ImageNet, the RAE model trained with MixFlow achieves 1.43 FID (unconditional) and 1.10 FID (classifier-guided), significantly outperforming baselines at both 256×256 and 512×512 resolutions.
📝 Abstract
This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.