SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing NF-VAE hybrid models suffer from two key limitations: (1) reliance on a pre-trained, fixed encoder—degrading reconstruction and generation quality; and (2) the need for auxiliary noise-corruption and denoising procedures—increasing training complexity. This work proposes a simplification strategy: fixing the VAE latent variance to a constant (e.g., 0.5), thereby explicitly simplifying the evidence lower bound (ELBO) and enabling stable, end-to-end joint optimization of NF and VAE. The approach eliminates both pre-trained encoders and intricate data augmentation schemes, substantially reducing training difficulty while improving generalization. On ImageNet 256×256, it achieves a gFID of 2.15; integrating REPA-E further lowers it to 1.91—the new state-of-the-art for flow-based image generation. To our knowledge, this is the first work to realize an efficient, conceptually simple, and high-performance end-to-end NF-VAE training paradigm.

Technology Category

Application Category

📝 Abstract
Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations. First, they add random noise to training samples or VAE latents as data augmentation, introducing complex pipelines including extra noising and denoising steps. Second, they use a pretrained and frozen VAE encoder, resulting in suboptimal reconstruction and generation quality. In this paper, we find that the two issues can be solved in a very simple way: just fixing the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5). On the one hand, this method allows the encoder to output a broader distribution of tokens and the decoder to learn to reconstruct clean images from the augmented token distribution, avoiding additional noise or denoising design. On the other hand, fixed variance simplifies the VAE evidence lower bound, making it stable to train an NF with a VAE jointly. On the ImageNet $256 imes 256$ generation task, our model SimFlow obtains a gFID score of 2.15, outperforming the state-of-the-art method STARFlow (gFID 2.40). Moreover, SimFlow can be seamlessly integrated with the end-to-end representation alignment (REPA-E) method and achieves an improved gFID of 1.91, setting a new state of the art among NFs.
Problem

Research questions and friction points this paper is trying to address.

Simplifies training by fixing VAE variance to constant
Eliminates need for extra noising and denoising steps
Improves reconstruction and generation quality jointly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fixed variance simplifies VAE and NF joint training
Constant variance avoids extra noise and denoising steps
End-to-end integration improves generation quality and efficiency
🔎 Similar Papers
No similar papers found.