Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

📅 2025-03-27
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in text-to-image generation—namely, inconsistent cross-modal modeling, poor task scalability, low inference efficiency, and limited generation quality—by proposing a synergistic framework comprising the Unified Next-DiT architecture and the UniCap high-fidelity image captioning system. Methodologically: (1) it jointly models text and image token sequences for end-to-end cross-modal alignment; (2) it introduces a multi-stage progressive training strategy to enhance convergence stability; and (3) it incorporates lossless inference acceleration techniques, including token pruning and cache reuse. With only 2.6 billion parameters, the framework achieves state-of-the-art performance across multiple benchmarks—including COCO captioning, FID, CLIP-Score, and Prompt Alignment—demonstrating substantial improvements in generation fidelity, prompt adherence, and both training and inference efficiency.

Technology Category

Application Category

📝 Abstract
We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.
Problem

Research questions and friction points this paper is trying to address.

Unified architecture for text-image token interaction
Efficient training and inference without quality loss
Scalable model with high performance using minimal parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified architecture for text-image token processing
Unified Captioner for accurate training pairs
Multi-stage training and inference acceleration
🔎 Similar Papers
No similar papers found.
Q
Qi Qin
The University of Sydney
Le Zhuo
Le Zhuo
Krea AI
generative modelsmulti-modal learning
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
R
Ruoyi Du
Shanghai AI Laboratory
Z
Zhen Li
The Chinese University of Hong Kong
B
Bin Fu
Shanghai AI Laboratory
Yiting Lu
Yiting Lu
University of Science and Technology of China
VLM,Self-evolving Agent,Reasoning Model
Jiakang Yuan
Jiakang Yuan
Fudan university
MLLMsMulti-agent SystemReasoning
X
Xinyue Li
Shanghai AI Laboratory
Dongyang Liu
Dongyang Liu
MMLab CUHK
Image/Video GenerationLLMsVLMs
X
Xiangyang Zhu
Shanghai AI Laboratory
M
Manyuan Zhang
The Chinese University of Hong Kong
W
Will Beddow
Krea AI
E
Erwann Millon
Krea AI
V
Victor Perez
Krea AI
W
Wenhai Wang
Shanghai AI Laboratory
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
B
Bo Zhang
Shanghai AI Laboratory
X
Xiaohong Liu
Shanghai Jiao Tong University
H
Hongsheng Li
The Chinese University of Hong Kong
Y
Yu Qiao
Shanghai AI Laboratory
C
Chang Xu
The University of Sydney
P
Peng Gao
Shanghai AI Laboratory