Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work introduces the first fully zero-initialized, decoder-only autoregressive foundation model for image generation, eliminating reliance on pretrained components (e.g., VAEs, CLIP) and hybrid architectures (e.g., diffusion + autoregression). Methodologically, it employs unified image tokenization, inference-time scaling, and speculative Jacobi sampling to achieve high-fidelity image synthesis and joint multimodal task modeling. Key contributions are: (1) the first native support—within a single autoregressive framework—for text-to-image generation, image editing, controllable synthesis, and dense prediction; (2) competitive or superior performance against DALL-E 3 and SANA on generation benchmarks including GenEval and DPG; and (3) strong generalization across diverse vision-language tasks on the Graph200K multimodal benchmark, empirically validating its viability and advancement as a unified multimodal generative foundation model.

Technology Category

Application Category

📝 Abstract

We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-mGPT-2.0.

Problem

Research questions and friction points this paper is trying to address.

Develops standalone autoregressive model for high-quality image generation

Unifies diverse tasks under single generative framework

Matches or surpasses diffusion models in performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stand-alone autoregressive model for image generation

Unified tokenization for multi-task handling

Efficient decoding strategies for quality and speed

🔎 Similar Papers

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining