🤖 AI Summary
This work addresses the limitations of conventional diffusion and flow-based models, which typically rely on multi-step sampling and latent space representations. The authors propose pixel MeanFlow (pMF), a novel approach that achieves high-quality single-step image generation without requiring a latent space for the first time. By decoupling the network output from the loss space and performing prediction directly on the image manifold, pMF introduces a MeanFlow velocity field loss together with a mapping mechanism from the image manifold to the velocity field. Combined with an x-prediction objective, this design significantly enhances both generation efficiency and fidelity. The method sets new state-of-the-art results for single-step generation, achieving FID scores of 2.22 and 2.48 on ImageNet at 256×256 and 512×512 resolutions, respectively.
📝 Abstract
Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose"pixel MeanFlow"(pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.