UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

πŸ“… 2025-10-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing unified visual tokenizers struggle to balance high-level semantic understanding and low-level pixel reconstruction, leading to performance trade-offs across tasks. This paper introduces UniFlowβ€”the first unified pixel-flow tokenizer enabling joint optimization of understanding and generation. Its core innovations are: (1) a layer-wise adaptive self-distillation mechanism that mitigates multi-task training conflicts; and (2) a lightweight block-wise pixel-flow decoder that achieves high-fidelity, semantics-conditioned detail reconstruction. Built upon a pre-trained vision encoder, UniFlow achieves state-of-the-art results across 13 benchmarks: its 7B variant outperforms a 14B competitor in understanding accuracy by 7.75%, while reducing rFID and gFID by 0.15 and 0.09, respectively. To our knowledge, UniFlow is the first unified architecture to simultaneously advance both understanding and generation capabilities without compromise.

Technology Category

Application Category

πŸ“ Abstract
Tokenizer is a crucial component for both visual understanding and generation. To advance toward the ultimate goal of universal modeling, recent research has focused on developing a unified tokenizer. However, existing tokenizers face a significant performance trade-off between understanding and generation, stemming from the inherent conflict between high-level semantic abstraction and low-level pixel reconstruction. To tackle this challenge, we propose a generic and unified tokenizer, namely UniFlow, by flexibly adapting any visual encoder with a concise reconstruction decoder. Specifically, we introduce layer-wise adaptive self-distillation applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding and flexibly adapt to model fine-grained details for visual generation. Moreover, we propose a lightweight patch-wise pixel flow decoder, which efficiently achieves high-fidelity pixel reconstruction by modeling a conditional flow from the noisy state back to the patch-wise pixel domain. By leveraging the semantic features as visual conditions for the decoder, we effectively alleviate the training conflicts between understanding and generation. Furthermore, the patch-wise learning strategy simplifies the data distribution, thereby improving training efficiency. Extensive experiments across 13 challenging benchmarks spanning 7 widely studied visual understanding and generation tasks demonstrate that UniFlow achieves a win-win outcome. For instance, our 7B UniFlow-XL not only surpasses the 14B TokenFlow-XL by 7.75% on average understanding benchmarks, but also achieves competitive results in both visual reconstruction and generation, surpassing UniTok by 0.15 in rFID and 0.09 in gFID (without guidance), respectively.
Problem

Research questions and friction points this paper is trying to address.

Unifying visual understanding and generation tokenizers to resolve performance trade-offs
Enabling semantic feature inheritance while adapting to fine-grained pixel reconstruction
Achieving high-fidelity pixel flow decoding with simplified patch-wise learning strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise adaptive self-distillation for semantic inheritance
Lightweight patch-wise pixel flow decoder for reconstruction
Unified tokenizer balancing understanding and generation tasks
πŸ”Ž Similar Papers
Zhengrong Yue
Zhengrong Yue
Shanghai Jiao Tong University, PhD
Unified Multimodal ModelingVideo UnderstandingVideo Generation
Haiyu Zhang
Haiyu Zhang
Beihang University
Neural Fields
X
Xiangyu Zeng
Nanjing University
Boyu Chen
Boyu Chen
The University of Sydney
Neural Architecture SearchTransformer
Chenting Wang
Chenting Wang
Shanghai Jiao Tong University
Computer VisionVideo Understanding
Shaobin Zhuang
Shaobin Zhuang
Shanghai Jiaotong University
Video GenerationComputer Vision
L
Lu Dong
University of Science and Technology of China
K
KunPeng Du
Shanghai Jiao Tong University
Y
Yi Wang
Shanghai AI Laboratory
L
Limin Wang
Nanjing University
Y
Yali Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences