๐ค AI Summary
This work addresses the challenges of cross-modal interference and representation imbalance in unified multimodal modeling. We propose UGen, the first model that discretizes both text and images into token sequences and processes them jointly via a single autoregressive Transformer for unified vision-language understanding and generation. Its core innovation is a progressive vocabulary learning mechanism that dynamically activates and fuses visual token IDs, effectively mitigating modality interference. Images are discretized using a VQ-VAE, and vocabulary expansion training is employed to accommodate visual tokens. Evaluated across a comprehensive multimodal benchmark comprising diverse vision-language tasks, UGen achieves an overall performance gain of 13.3% over existing unified modelsโmatching or exceeding the performance of task-specific state-of-the-art models.
๐ Abstract
We introduce UGen, a unified autoregressive multimodal model that demonstrates strong performance across text processing, image understanding, and image generation tasks simultaneously. UGen converts both texts and images into discrete token sequences and utilizes a single transformer to generate them uniformly in an autoregressive manner. To address the challenges associated with unified multimodal learning, UGen is trained using a novel mechanism, namely progressive vocabulary learning. In this process, visual token IDs are incrementally activated and integrated into the training phase, ultimately enhancing the effectiveness of unified multimodal learning. Experiments on comprehensive text and image tasks show that UGen achieves a significant overall performance improvement of 13.3% compared to the vanilla unified autoregressive method, and it also delivers competitive results across all tasks against several task-specific models.