π€ AI Summary
This work addresses the challenge of jointly optimizing visual generation and understanding tasks. We propose UniFluid, a unified autoregressive framework thatβ for the first timeβco-models discrete text tokens and continuous image tokens within a single architecture, supporting multimodal (image + text) inputs and enabling end-to-end simultaneous execution of generation (e.g., image synthesis, editing) and understanding (e.g., captioning, visual question answering) tasks. Key innovations include: (i) continuous visual token representations; (ii) a dynamic loss balancing mechanism; and (iii) stochastic ordering of token generation, collectively breaking rigid task trade-offs and enabling bidirectional enhancement. Built upon the Gemma family of LLMs, UniFluid integrates reinforcement-guided pretraining with multimodal autoregressive modeling. Experiments demonstrate state-of-the-art or superior performance across all generation and understanding subtasks, alongside strong cross-task generalization capability.
π Abstract
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other. By selecting an appropriate loss balance weight, the unified model achieves results comparable to or exceeding those of single-task baselines on both tasks. Furthermore, we demonstrate that employing stronger pre-trained LLMs and random-order generation during training is important to achieve high-fidelity image generation within this unified framework. Built upon the Gemma model series, UniFluid exhibits competitive performance across both image generation and understanding, demonstrating strong transferability to various downstream tasks, including image editing for generation, as well as visual captioning and question answering for understanding.