Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

📅 2025-03-27

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

This work addresses key challenges in text-to-image generation—namely, inconsistent cross-modal modeling, poor task scalability, low inference efficiency, and limited generation quality—by proposing a synergistic framework comprising the Unified Next-DiT architecture and the UniCap high-fidelity image captioning system. Methodologically: (1) it jointly models text and image token sequences for end-to-end cross-modal alignment; (2) it introduces a multi-stage progressive training strategy to enhance convergence stability; and (3) it incorporates lossless inference acceleration techniques, including token pruning and cache reuse. With only 2.6 billion parameters, the framework achieves state-of-the-art performance across multiple benchmarks—including COCO captioning, FID, CLIP-Score, and Prompt Alignment—demonstrating substantial improvements in generation fidelity, prompt adherence, and both training and inference efficiency.

Technology Category

Application Category

📝 Abstract

We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.

Problem

Research questions and friction points this paper is trying to address.

Unified architecture for text-image token interaction

Efficient training and inference without quality loss

Scalable model with high performance using minimal parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified architecture for text-image token processing

Unified Captioner for accurate training pairs

Multi-stage training and inference acceleration

🔎 Similar Papers

No similar papers found.

Authors to Follow