WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual tokenizers struggle to simultaneously achieve high compression ratios and high-fidelity reconstruction. This paper proposes WeTok, a novel visual tokenizer that integrates Grouped Quantization without Lookup (GQ) and Generative Decoding (GD) to enable scalable codebook learning and probabilistic modeling of visual distributions—while maintaining computational efficiency. GQ enhances quantization accuracy and scalability by grouping latent features for joint quantization without reliance on lookup tables; GD leverages noise priors to reconstruct high-quality images from discrete tokens. On ImageNet-50k, WeTok achieves a compression ratio of 768 with rFID = 3.49, substantially outperforming state-of-the-art methods; its zero-shot rFID reaches 0.12—the new best record. To our knowledge, this is the first work to jointly employ lookup-free quantization and generative decoding for visual tokenization, delivering unified advances in compression efficiency, reconstruction quality, and representational modeling capability.

Technology Category

Application Category

📝 Abstract
Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.
Problem

Research questions and friction points this paper is trying to address.

Improves trade-off between compression and reconstruction fidelity
Introduces group-wise lookup-free quantization for efficiency
Uses generative decoding to model visual data distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group-wise lookup-free quantization for scalable codebooks
Generative decoding with extra noise variable
High-fidelity reconstruction at high compression ratios
🔎 Similar Papers
Shaobin Zhuang
Shaobin Zhuang
Shanghai Jiaotong University
Video GenerationComputer Vision
Y
Yiwei Guo
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
C
Canmiao Fu
WeChat Vision, Tencent Inc.
Zhipeng Huang
Zhipeng Huang
Microsoft Research Asia && University of Science and Technology of China
Multi-ModalityComputer Vision
Zeyue Tian
Zeyue Tian
Hong Kong University of Science and Technology
Music GenerationGenerative AIMulti-Modal Learning
Y
Ying Zhang
WeChat Vision, Tencent Inc.
C
Chen Li
WeChat Vision, Tencent Inc.
Y
Yali Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shanghai AI Laboratory