๐ค AI Summary
To address the reliance on pretrained VAEs and the computational overhead and architectural complexity introduced by latent-space mapping in image generation, this paper proposes PixelFlowโthe first end-to-end trainable, purely pixel-space flow-based generative model. PixelFlow eliminates VAE encoders and decoders entirely, avoiding any latent-space projection, and instead introduces a learnable pixel-wise normalizing flow architecture. It employs an efficient cascade flow design to enable high-resolution modeling (256ร256). On class-conditional ImageNet generation, PixelFlow achieves an FID of 1.98, substantially outperforming prior pixel-space methods. Moreover, in text-to-image synthesis, it demonstrates superior detail fidelity, enhanced semantic controllability, and improved artistic expressiveness.
๐ Abstract
We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models. This approach simplifies the image generation process by eliminating the need for a pre-trained Variational Autoencoder (VAE) and enabling the whole model end-to-end trainable. Through efficient cascade flow modeling, PixelFlow achieves affordable computation cost in pixel space. It achieves an FID of 1.98 on 256$ imes$256 ImageNet class-conditional image generation benchmark. The qualitative text-to-image results demonstrate that PixelFlow excels in image quality, artistry, and semantic control. We hope this new paradigm will inspire and open up new opportunities for next-generation visual generation models. Code and models are available at https://github.com/ShoufaChen/PixelFlow.