🤖 AI Summary
Conditional diffusion models are susceptible to noisy or weakly aligned conditioning signals—such as erroneous labels or ambiguous textual descriptions—leading to degraded generation quality. To address this, we propose a robust training paradigm that for the first time formulates conditional consistency as a learnable continuous latent variable embedded within the diffusion process, enabling the model to adaptively weight or suppress low-quality conditioning inputs. Our method extends the U-Net architecture to jointly encode both conditioning information and consistency scores, and introduces a theory-driven weighted loss function for end-to-end optimization. Evaluated across diverse conditional generation tasks, our approach achieves a 12.3% reduction in FID and a 9.7% improvement in CLIP Score, with markedly enhanced conditional fidelity. Generated samples exhibit superior realism, diversity, and strict adherence to high-fidelity constraints—without discarding any training samples.
📝 Abstract
Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional infor-mation may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that in-tegrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations with-out discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that lever-aging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded. Code and weights here.