Effective and Efficient Masked Image Generation Models

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work unifies two distinct paradigms—masked image generation (MIG) and masked diffusion (MD)—to develop eMIGM, an efficient and high-fidelity generative model. Methodologically, it introduces the first unified framework integrating masked modeling, discrete diffusion, parameter-sharing architecture, and FID-driven training optimization, alongside a sampling scheduling strategy balancing performance and computational efficiency. Key contributions include: (1) establishing general design principles for masked generative models; (2) surpassing VAR on ImageNet 256×256 in both sample quality and efficiency; and (3) outperforming state-of-the-art continuous diffusion models on ImageNet 512×512 with only 60% of their function evaluations (NFE), while maintaining competitive fidelity even when NFE is reduced by over 60%. The framework thus advances generative modeling by bridging discrete and diffusion-based approaches without sacrificing quality or scalability.

Technology Category

Application Category

📝 Abstract
Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fr'echet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models.
Problem

Research questions and friction points this paper is trying to address.

Unifying masked image generation and diffusion models
Optimizing training and sampling for performance and efficiency
Improving image generation quality with fewer computational resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for masked image generation
Optimized training and sampling design
High efficiency with fewer function evaluations
🔎 Similar Papers
No similar papers found.
Zebin You
Zebin You
renmin university of china
generative modeldiffusion modelsemi-supervised learningself-supervised learning
J
Jingyang Ou
Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China
X
Xiaolu Zhang
Ant Group
J
Jun Hu
Ant Group
J
Jun Zhou
Ant Group
Chongxuan Li
Chongxuan Li
Associate Professor, Renmin University of China
Machine LearningGenerative ModelsDeep Learning