Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the long-standing limitation of reinforcement learning (RL) in vision generation—its dependence on continuous or non-causal visual representations. We propose Selftok, the first discrete vision tokenizer that discards spatial priors. Its core innovation lies in unifying the diffusion inversion process with an autoregressive (AR) prior, thereby theoretically establishing that AR visual tokens satisfy the Bellman equation and naturally support policy-gradient RL. This integration achieves a principled unification of diffusion and AR paradigms, enabling pure AR multimodal modeling without auxiliary modules. Experiments demonstrate that Selftok achieves state-of-the-art trade-offs between reconstruction fidelity and compression rate. Crucially, it significantly outperforms existing vision generative models under text–image pair-free supervision and, for the first time, attains RL performance on visual tokens comparable to that of large language models.

Technology Category

Application Category

📝 Abstract
We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the Bellman equation, whereas the spatial prior does not. Therefore, Selftok supports reinforcement learning (RL) for visual generation with effectiveness comparable to that achieved in LLMs. Besides the AR property, Selftok is also a SoTA tokenizer that achieves a favorable trade-off between high-quality reconstruction and compression rate. We use Selftok to build a pure AR VLM for both visual comprehension and generation tasks. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin. Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM. Project Page: https://selftok-team.github.io/report/.
Problem

Research questions and friction points this paper is trying to address.

Unifying diffusion and autoregression for vision-language models
Enabling reinforcement learning in visual token generation
Achieving high-quality image reconstruction and compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Self-consistency Tokenizer (Selftok) for images
Unifies diffusion and autoregressive prior in vision-language models
Enables reinforcement learning for visual generation effectively
🔎 Similar Papers
No similar papers found.
B
Bohan Wang
Media Technology Institute, Huawei Singapore
Z
Zhongqi Yue
Media Technology Institute, Huawei Singapore
F
Fengda Zhang
Media Technology Institute, Huawei Singapore
S
Shuo Chen
Media Technology Institute, Huawei Singapore
L
Li'an Bi
Media Technology Institute, Huawei Singapore
Junzhe Zhang
Junzhe Zhang
Syracuse University
Causal InferenceArtificial Intelligence
X
Xue Song
Media Technology Institute, Huawei Singapore
K
Kennard Yanting Chan
Media Technology Institute, Huawei Singapore
Jiachun Pan
Jiachun Pan
National University of Singapore
Information theorydeep generative models
Weijia Wu
Weijia Wu
National University of Singapore; Zhejiang University
Video GenerationLLMAIGC
M
Mingze Zhou
Media Technology Institute, Huawei Singapore
Wang Lin
Wang Lin
Zhejiang University
Computer VisionMulti-Modal LearningVideo Understanding
Kaihang Pan
Kaihang Pan
Zhejiang University
nlpvision-and-language
Saining Zhang
Saining Zhang
College of Computing and Data Science, Nanyang Technological University
Computer Vision
Liyu Jia
Liyu Jia
Nanyang Technological University
Wentao Hu
Wentao Hu
PhD student, The Hong Kong Polytechnic University
Large Language ModelComputer Vision
W
Wei Zhao
Media Technology Institute, Huawei Singapore
H
Hanwang Zhang
Media Technology Institute, Huawei Singapore