🤖 AI Summary
This work unifies two distinct paradigms—masked image generation (MIG) and masked diffusion (MD)—to develop eMIGM, an efficient and high-fidelity generative model. Methodologically, it introduces the first unified framework integrating masked modeling, discrete diffusion, parameter-sharing architecture, and FID-driven training optimization, alongside a sampling scheduling strategy balancing performance and computational efficiency. Key contributions include: (1) establishing general design principles for masked generative models; (2) surpassing VAR on ImageNet 256×256 in both sample quality and efficiency; and (3) outperforming state-of-the-art continuous diffusion models on ImageNet 512×512 with only 60% of their function evaluations (NFE), while maintaining competitive fidelity even when NFE is reduced by over 60%. The framework thus advances generative modeling by bridging discrete and diffusion-based approaches without sacrificing quality or scalability.
📝 Abstract
Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fr'echet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models.