UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the challenges of cross-modal interference and representation imbalance in unified multimodal modeling. We propose UGen, the first model that discretizes both text and images into token sequences and processes them jointly via a single autoregressive Transformer for unified vision-language understanding and generation. Its core innovation is a progressive vocabulary learning mechanism that dynamically activates and fuses visual token IDs, effectively mitigating modality interference. Images are discretized using a VQ-VAE, and vocabulary expansion training is employed to accommodate visual tokens. Evaluated across a comprehensive multimodal benchmark comprising diverse vision-language tasks, UGen achieves an overall performance gain of 13.3% over existing unified models—matching or exceeding the performance of task-specific state-of-the-art models.

Technology Category

Application Category

📝 Abstract

We introduce UGen, a unified autoregressive multimodal model that demonstrates strong performance across text processing, image understanding, and image generation tasks simultaneously. UGen converts both texts and images into discrete token sequences and utilizes a single transformer to generate them uniformly in an autoregressive manner. To address the challenges associated with unified multimodal learning, UGen is trained using a novel mechanism, namely progressive vocabulary learning. In this process, visual token IDs are incrementally activated and integrated into the training phase, ultimately enhancing the effectiveness of unified multimodal learning. Experiments on comprehensive text and image tasks show that UGen achieves a significant overall performance improvement of 13.3% compared to the vanilla unified autoregressive method, and it also delivers competitive results across all tasks against several task-specific models.

Problem

Research questions and friction points this paper is trying to address.

Unified multimodal learning for text and image tasks

Progressive vocabulary learning for visual token integration

Autoregressive generation of text and image sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified autoregressive multimodal model for text and images

Progressive vocabulary learning enhances multimodal integration

Single transformer generates text and image tokens uniformly

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models