🤖 AI Summary
Diffusion models often deviate from the underlying data manifold under arbitrary guidance, degrading sample fidelity. To address this, we propose Temporal Alignment Guidance (TAG), a lightweight, plug-and-play mechanism that dynamically estimates the temporal deviation between the current sampling step and the target data manifold via an auxiliary time predictor. TAG then applies gradient-based correction to realign the sample trajectory at each step, enforcing fine-grained manifold constraints without modifying the diffusion model or requiring additional training. Empirically, TAG significantly improves generation quality across text-to-image synthesis and class-conditional generation: samples remain consistently closer to the true data manifold throughout sampling, exhibit enhanced robustness to guidance scale, achieve up to a 12.3% reduction in FID, and preserve both sampling efficiency and diversity.
📝 Abstract
Diffusion models have achieved remarkable success as generative models. However, even a well-trained model can accumulate errors throughout the generation process. These errors become particularly problematic when arbitrary guidance is applied to steer samples toward desired properties, which often breaks sample fidelity. In this paper, we propose a general solution to address the off-manifold phenomenon observed in diffusion models. Our approach leverages a time predictor to estimate deviations from the desired data manifold at each timestep, identifying that a larger time gap is associated with reduced generation quality. We then design a novel guidance mechanism, `Temporal Alignment Guidance' (TAG), attracting the samples back to the desired manifold at every timestep during generation. Through extensive experiments, we demonstrate that TAG consistently produces samples closely aligned with the desired manifold at each timestep, leading to significant improvements in generation quality across various downstream tasks.