🤖 AI Summary
This work addresses key challenges in text-to-image generation—namely, inconsistent cross-modal modeling, poor task scalability, low inference efficiency, and limited generation quality—by proposing a synergistic framework comprising the Unified Next-DiT architecture and the UniCap high-fidelity image captioning system. Methodologically: (1) it jointly models text and image token sequences for end-to-end cross-modal alignment; (2) it introduces a multi-stage progressive training strategy to enhance convergence stability; and (3) it incorporates lossless inference acceleration techniques, including token pruning and cache reuse. With only 2.6 billion parameters, the framework achieves state-of-the-art performance across multiple benchmarks—including COCO captioning, FID, CLIP-Score, and Prompt Alignment—demonstrating substantial improvements in generation fidelity, prompt adherence, and both training and inference efficiency.
📝 Abstract
We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.