Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces the first fully zero-initialized, decoder-only autoregressive foundation model for image generation, eliminating reliance on pretrained components (e.g., VAEs, CLIP) and hybrid architectures (e.g., diffusion + autoregression). Methodologically, it employs unified image tokenization, inference-time scaling, and speculative Jacobi sampling to achieve high-fidelity image synthesis and joint multimodal task modeling. Key contributions are: (1) the first native support—within a single autoregressive framework—for text-to-image generation, image editing, controllable synthesis, and dense prediction; (2) competitive or superior performance against DALL-E 3 and SANA on generation benchmarks including GenEval and DPG; and (3) strong generalization across diverse vision-language tasks on the Graph200K multimodal benchmark, empirically validating its viability and advancement as a unified multimodal generative foundation model.

Technology Category

Application Category

📝 Abstract
We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-mGPT-2.0.
Problem

Research questions and friction points this paper is trying to address.

Develops standalone autoregressive model for high-quality image generation
Unifies diverse tasks under single generative framework
Matches or surpasses diffusion models in performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stand-alone autoregressive model for image generation
Unified tokenization for multi-task handling
Efficient decoding strategies for quality and speed
🔎 Similar Papers
No similar papers found.
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
J
Juncheng Yan
Shanghai AI Laboratory
Q
Qi Qin
Shanghai AI Laboratory
Z
Zhen Li
The Chinese University of Hong Kong
Dongyang Liu
Dongyang Liu
MMLab CUHK
Image/Video GenerationLLMsVLMs
S
Shicheng Li
Shanghai AI Laboratory
V
Victor Shea-Jay Huang
The Chinese University of Hong Kong
Y
Yupeng Zhou
Shanghai AI Laboratory
Renrui Zhang
Renrui Zhang
Seed ByteDance & MMLab & PKU
Large Multimodal ModelGenerative ModelEmbodied AI
Le Zhuo
Le Zhuo
Krea AI
generative modelsmulti-modal learning
Tiancheng Han
Tiancheng Han
School of Physical Science and Technology, Southwest University, Chongqing 400715, China
Transformation opticsMetamaterialsHeat manipulationSuper-resolution imaging
X
Xiaoqing Sun
Shanghai Innovation Institute
Siqi Luo
Siqi Luo
Shanghai Jiao Tong university
AIGCComputer VisionImage EditingAI4Science
M
Mengmeng Wang
Zhejiang University of Technology
B
Bin Fu
Shanghai AI Laboratory
Yuewen Cao
Yuewen Cao
The Chinese University of Hong Kong
H
Hongsheng Li
The Chinese University of Hong Kong
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays
X
Xiaohong Liu
Shanghai Jiao Tong University
Y
Yu Qiao
Shanghai AI Laboratory
P
Peng Gao
Shanghai AI Laboratory