🤖 AI Summary
Existing visual tokenizers struggle to simultaneously achieve high compression ratios and high-fidelity reconstruction. This paper proposes WeTok, a novel visual tokenizer that integrates Grouped Quantization without Lookup (GQ) and Generative Decoding (GD) to enable scalable codebook learning and probabilistic modeling of visual distributions—while maintaining computational efficiency. GQ enhances quantization accuracy and scalability by grouping latent features for joint quantization without reliance on lookup tables; GD leverages noise priors to reconstruct high-quality images from discrete tokens. On ImageNet-50k, WeTok achieves a compression ratio of 768 with rFID = 3.49, substantially outperforming state-of-the-art methods; its zero-shot rFID reaches 0.12—the new best record. To our knowledge, this is the first work to jointly employ lookup-free quantization and generative decoding for visual tokenization, delivering unified advances in compression efficiency, reconstruction quality, and representational modeling capability.
📝 Abstract
Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.