🤖 AI Summary
Existing diffusion models typically employ pointwise reconstruction losses that are insensitive to signal spectra and multiscale structures, often yielding samples with imbalanced frequency content and insufficient structural detail. To address this limitation, this work proposes a lightweight, plug-and-play spectral regularization framework that introduces differentiable Fourier- and wavelet-domain losses as soft inductive biases during standard training. The approach requires no modifications to the diffusion process, model architecture, or sampling procedure, and is compatible with mainstream paradigms such as DDPM, DDIM, and EDM. Evaluated on high-resolution unconditional generation tasks for both images and audio, the method consistently enhances sample quality—particularly in terms of spectral fidelity and multiscale coherence—while incurring negligible computational overhead.
📝 Abstract
Diffusion models are typically trained using pointwise reconstruction objectives that are agnostic to the spectral and multi-scale structure of natural signals. We propose a loss-level spectral regularization framework that augments standard diffusion training with differentiable Fourier- and wavelet-domain losses, without modifying the diffusion process, model architecture, or sampling procedure. The proposed regularizers act as soft inductive biases that encourage appropriate frequency balance and coherent multi-scale structure in generated samples. Our approach is compatible with DDPM, DDIM, and EDM formulations and introduces negligible computational overhead. Experiments on image and audio generation demonstrate consistent improvements in sample quality, with the largest gains observed on higher-resolution, unconditional datasets where fine-scale structure is most challenging to model.