Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

📅 2024-08-05

🏛️ arXiv.org

📈 Citations: 46

✨ Influential: 8

career value

168K/year

🤖 AI Summary

To address limited flexibility in text-to-image generation and weak generalization of multimodal models, this paper introduces the Lumina-mGPT series: pure-decoder autoregressive models featuring two novel techniques—Unambiguous Image Representation (UniRep) and Flexible Progressive Supervised Fine-Tuning (FP-SFT)—enabling, for the first time in an autoregressive framework, image generation quality competitive with diffusion models. Furthermore, we propose Omni-SFT, a unified multi-task supervised fine-tuning paradigm that jointly supports over ten vision, language, and vision-language tasks—including aspect-ratio-agnostic and multi-view text-to-image generation, controllable synthesis, semantic segmentation, depth estimation, and multi-turn visual question answering. Lumina-mGPT achieves state-of-the-art performance in generation fidelity, compositional flexibility, and cross-task generalization. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. By initializing from multimodal Generative PreTraining (mGPT), we demonstrate that decoder-only Autoregressive (AR) model can achieve image generation performance comparable to modern diffusion models with high efficiency through Flexible Progressive Supervised Fine-tuning (FP-SFT). Equipped with our proposed Unambiguous image Representation (UniRep), Lumina-mGPT can flexibly generate high-quality images of varying aspect ratios. Building on the strong image generation capabilities, we further explore Ominiponent Supervised Fine-tuning (Omni-SFT), an initial attempt to elevate Lumina-mGPT into a unified multi-modal generalist. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like text-to-image/multiview generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multi-turn visual question answering, showing the rosy potential of the technical direction. Codes and checkpoints are available at https://github.com/Alpha-VLLM/Lumina-mGPT.

Problem

Research questions and friction points this paper is trying to address.

Enabling flexible photorealistic text-to-image generation

Achieving image generation comparable to diffusion models

Unifying multimodal tasks with a single generalist model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal autoregressive models for text-to-image

Flexible Progressive Supervised Fine-tuning (FP-SFT)

Unambiguous image Representation (UniRep) for varied aspect ratios

🔎 Similar Papers

Unified Text-to-Image Generation and Retrieval