QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing visual tokenization methods struggle to simultaneously achieve high-fidelity image reconstruction and zero-shot visual understanding, while multimodal understanding and generation typically require separate modeling paradigms. To address these limitations, we propose QLIP—a unified visual tokenization framework based on a binary spherical quantized autoencoder. By jointly optimizing reconstruction and image-text alignment objectives—via dynamically balanced dual-task loss and a two-stage pretraining paradigm (large-scale image-text contrastive learning followed by memory-efficient reconstruction fine-tuning)—QLIP is the first method to enable a single model to support high-fidelity image reconstruction, zero-shot image classification, and text-conditioned image generation within a unified architecture. As a plug-and-play component, QLIP consistently outperforms the original visual encoders/tokenizers in both LLaVA and LlamaGen, enabling end-to-end autoregressive multimodal understanding and generation.

Technology Category

Application Category

📝 Abstract

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.

Problem

Research questions and friction points this paper is trying to address.

Unifies multimodal understanding and generation

Balances reconstruction and alignment objectives

Enables a unified mixed-modality auto-regressive model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines reconstruction and alignment objectives

Dynamic balance of loss terms

Unified multimodal auto-regressive model

🔎 Similar Papers

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining