Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual tokenization methods rely on discrete latent spaces, where quantization errors fundamentally limit semantic expressiveness and cross-modal understanding. To address this, we propose MingTok—the first unified visual tokenizer operating in a continuous latent space. MingTok employs a three-stage architecture comprising low-level encoding, semantic expansion, and visual reconstruction, jointly modeling both visual understanding and generation as an autoregressive “next-token prediction” task within a shared continuous representation space. This design eliminates quantization-induced distortions inherent in discrete tokenizers, enabling multi-turn in-context learning, iterative reasoning, and cross-task editing. Extensive evaluations demonstrate state-of-the-art performance across diverse vision-language understanding and generation benchmarks. The code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract
Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.
Problem

Research questions and friction points this paper is trying to address.

Addresses visual tokenization limitations in unifying understanding and generation
Reconciles competing demands of discriminative features and compact codes
Unifies diverse vision-language tasks under single autoregressive prediction paradigm
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous latent space tokenizer for vision tasks
Three-stage architecture for encoding and reconstruction
Unified autoregressive prediction for understanding and generation
🔎 Similar Papers
No similar papers found.
Z
Ziyuan Huang
Inclusion AI, Ant Group
D
DanDan Zheng
Inclusion AI, Ant Group
C
Cheng Zou
Inclusion AI, Ant Group
R
Rui Liu
Inclusion AI, Ant Group
X
Xiaolong Wang
Inclusion AI, Ant Group
Kaixiang Ji
Kaixiang Ji
Ant Group
Computer VisionMultimodal
W
Weilong Chai
Inclusion AI, Ant Group
J
Jianxin Sun
Inclusion AI, Ant Group
L
Libin Wang
Inclusion AI, Ant Group
Y
Yongjie Lv
Inclusion AI, Ant Group
T
Taozhi Huang
Inclusion AI, Ant Group
Jiajia Liu
Jiajia Liu
Ant Group
cv multimodal
Qingpei Guo
Qingpei Guo
Ant Group
Multimodal LLMsVision-Language Models
M
Ming Yang
Inclusion AI, Ant Group
J
Jingdong Chen
Inclusion AI, Ant Group
J
Jun Zhou
Inclusion AI, Ant Group