Effective and Efficient Masked Image Generation Models

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work unifies two distinct paradigms—masked image generation (MIG) and masked diffusion (MD)—to develop eMIGM, an efficient and high-fidelity generative model. Methodologically, it introduces the first unified framework integrating masked modeling, discrete diffusion, parameter-sharing architecture, and FID-driven training optimization, alongside a sampling scheduling strategy balancing performance and computational efficiency. Key contributions include: (1) establishing general design principles for masked generative models; (2) surpassing VAR on ImageNet 256×256 in both sample quality and efficiency; and (3) outperforming state-of-the-art continuous diffusion models on ImageNet 512×512 with only 60% of their function evaluations (NFE), while maintaining competitive fidelity even when NFE is reduced by over 60%. The framework thus advances generative modeling by bridging discrete and diffusion-based approaches without sacrificing quality or scalability.

Technology Category

Application Category

📝 Abstract

Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fr'echet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models.

Problem

Research questions and friction points this paper is trying to address.

Unifying masked image generation and diffusion models

Optimizing training and sampling for performance and efficiency

Improving image generation quality with fewer computational resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for masked image generation

Optimized training and sampling design

High efficiency with fewer function evaluations

🔎 Similar Papers

No similar papers found.