Training Consistency Models with Variational Noise Coupling

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high variance and instability in non-distillation consistency model training, this paper proposes a variational noise coupling method grounded in the flow matching framework. Methodologically, it (1) introduces a trainable noise emission mechanism, eliminating reliance on predefined forward diffusion processes; (2) incorporates a data-driven encoder to jointly model the geometric mapping between the noise space and the data manifold; and (3) integrates flow matching with a VAE architecture to enable data-dependent noise modeling. Experimentally, the approach achieves state-of-the-art (SoTA) FID scores among non-distillation consistency models: 2.41 on CIFAR-10 and 1.97 on ImageNet 64×64 under two-step generation. Notably, this work constitutes the first end-to-end co-learning framework that jointly optimizes noise dynamics and data structure.

Technology Category

Application Category

📝 Abstract
Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and improving its training dynamics is an active area of research. In this work, we propose a novel CT training approach based on the Flow Matching framework. Our main contribution is a trained noise-coupling scheme inspired by the architecture of Variational Autoencoders (VAE). By training a data-dependent noise emission model implemented as an encoder architecture, our method can indirectly learn the geometry of the noise-to-data mapping, which is instead fixed by the choice of the forward process in classical CT. Empirical results across diverse image datasets show significant generative improvements, with our model outperforming baselines and achieving the state-of-the-art (SoTA) non-distillation CT FID on CIFAR-10, and attaining FID on par with SoTA on ImageNet at $64 imes 64$ resolution in 2-step generation. Our code is available at https://github.com/sony/vct .
Problem

Research questions and friction points this paper is trying to address.

Improve consistency training stability
Enhance image generation performance
Develop noise-coupling scheme inspired by VAE
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Noise Coupling
Flow Matching framework
Encoder-based noise emission
🔎 Similar Papers
No similar papers found.