MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

📅 2023-10-03

🏛️ arXiv.org

📈 Citations: 115

✨ Influential: 6

career value

208K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) exhibit strong multimodal understanding capabilities but remain limited in joint image-text generation—particularly in scenarios lacking paired textual descriptions. This paper introduces generative vokens: discrete, learnable units that unify visual and linguistic modalities into a single generative token space. We propose a two-stage, description-free training paradigm that enables purely generative, description-agnostic, interleaved image-text co-generation for the first time—eliminating classifier-based guidance to enhance generation consistency. Our method achieves significant improvements over state-of-the-art baselines on MMDialog and VIST benchmarks; human evaluation shows that over 56% of generated samples surpass baseline quality. Key contributions include: (1) generative vokens as a unified, modality-agnostic representation; and (2) an end-to-end joint generation framework requiring no image–text pairing supervision.

📝 Abstract

The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Generates images and texts simultaneously with coherence

Develops a two-stage training strategy without image descriptions

Enhances multimodal alignment using classifier-free guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative vokens for coherent image-text generation

Two-stage training without needing image descriptions

Classifier-free guidance improves multimodal alignment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs