🤖 AI Summary
This work addresses the challenge of accurately approximating the gradient of the output distribution in diffusion models. We propose Mean-Shift Distillation (MSD), the first method to rigorously incorporate mean-shift theory into the diffusion distillation framework. MSD requires no retraining or modification of the sampling procedure; instead, it directly applies mean-shift mode optimization on the output distribution to achieve strict alignment between gradient extrema and true data modes. To efficiently estimate gradients while ensuring both mode alignment and convergence stability, we introduce a product-distribution sampling strategy. Integrated with Stable Diffusion, MSD significantly improves modality alignment accuracy and convergence speed in text-to-image and text-to-3D generation tasks, yielding higher-fidelity outputs.
📝 Abstract
We present mean-shift distillation, a novel diffusion distillation technique that provides a provably good proxy for the gradient of the diffusion output distribution. This is derived directly from mean-shift mode seeking on the distribution, and we show that its extrema are aligned with the modes. We further derive an efficient product distribution sampling procedure to evaluate the gradient. Our method is formulated as a drop-in replacement for score distillation sampling (SDS), requiring neither model retraining nor extensive modification of the sampling procedure. We show that it exhibits superior mode alignment as well as improved convergence in both synthetic and practical setups, yielding higher-fidelity results when applied to both text-to-image and text-to-3D applications with Stable Diffusion.