π€ AI Summary
This work addresses the challenge in few-step diffusion generation where the Gaussian denoising assumption breaks down and existing methods struggle to simultaneously achieve high sample quality and accurate likelihood estimation. The authors propose a novel framework based on conditional normalizing flows, employing shallow invertible blocks and a cross-trajectory deep parallel predictor to model the full reverse trajectory while preserving exact likelihood during training. This approach enables, for the first time, full-trajectory likelihood optimization under few-step generation. A self-distillation mechanism is introduced, leveraging the modelβs own score signals to train a lightweight denoiser. The method supports training from scratch or initialization from a pretrained flow-matching model, and achieves state-of-the-art or competitive performance on text-to-image synthesis with only four sampling steps, maintaining both high visual quality and computable likelihood.
π Abstract
Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.