Unified Autoregressive Visual Generation and Understanding with Continuous Tokens

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of jointly optimizing visual generation and understanding tasks. We propose UniFluid, a unified autoregressive framework that— for the first time—co-models discrete text tokens and continuous image tokens within a single architecture, supporting multimodal (image + text) inputs and enabling end-to-end simultaneous execution of generation (e.g., image synthesis, editing) and understanding (e.g., captioning, visual question answering) tasks. Key innovations include: (i) continuous visual token representations; (ii) a dynamic loss balancing mechanism; and (iii) stochastic ordering of token generation, collectively breaking rigid task trade-offs and enabling bidirectional enhancement. Built upon the Gemma family of LLMs, UniFluid integrates reinforcement-guided pretraining with multimodal autoregressive modeling. Experiments demonstrate state-of-the-art or superior performance across all generation and understanding subtasks, alongside strong cross-task generalization capability.

Technology Category

Application Category

📝 Abstract

We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other. By selecting an appropriate loss balance weight, the unified model achieves results comparable to or exceeding those of single-task baselines on both tasks. Furthermore, we demonstrate that employing stronger pre-trained LLMs and random-order generation during training is important to achieve high-fidelity image generation within this unified framework. Built upon the Gemma model series, UniFluid exhibits competitive performance across both image generation and understanding, demonstrating strong transferability to various downstream tasks, including image editing for generation, as well as visual captioning and question answering for understanding.

Problem

Research questions and friction points this paper is trying to address.

Unified framework for visual generation and understanding.

Balances image generation and understanding tasks effectively.

Enhances performance using pre-trained LLMs and random-order generation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified autoregressive framework for multimodal tasks

Continuous visual tokens for image generation

Balanced loss weights for joint task improvement

🔎 Similar Papers

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation