GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image tokenization methods rely on local supervision, leading to uneven semantic distribution and limiting reconstruction and autoregressive generation performance. To address this, we propose GloTok—the first approach to guide codebook learning via global semantic relationships. GloTok enforces semantic uniformity through codebook-level histogram relationship modeling, introduces a residual recovery module to mitigate quantization error, and incorporates global feature distillation to enable end-to-end training without access to pretrained models. By eliminating reliance on local supervision, GloTok significantly improves the quality and expressiveness of latent representations. On ImageNet-1K, GloTok achieves state-of-the-art performance in both image reconstruction and autoregressive generation, comprehensively outperforming mainstream locally supervised methods.

Technology Category

Application Category

📝 Abstract
Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module that recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
Problem

Research questions and friction points this paper is trying to address.

Improving image tokenization uniformity through global semantic modeling
Enhancing image reconstruction quality with residual detail recovery
Enabling high-quality image generation without pre-trained model dependencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes global relational information for uniform token distribution
Implements codebook-wise histogram relation learning for semantic transfer
Employs residual learning module to recover fine-grained details
🔎 Similar Papers
No similar papers found.
Xuan Zhao
Xuan Zhao
PhD, Forschungszentrum Jülich GmbH
XAIFair AI
Z
Zhongyu Zhang
Tencent Youtu Lab, Shanghai, China
Y
Yuge Huang
Tencent Youtu Lab, Shanghai, China
Yuxi Mi
Yuxi Mi
Fudan University
Face RecognitionPrivacyBiometricsComputer Vision
G
Guodong Mu
Tencent Youtu Lab, Shanghai, China
S
Shouhong Ding
Tencent Youtu Lab, Shanghai, China
J
Jun Wang
Tencent WeChat Pay Lab, Shanghai, China
R
Rizen Guo
Tencent WeChat Pay Lab, Shanghai, China
Shuigeng Zhou
Shuigeng Zhou
Fudan University
DatabaseBioinformaticsMachine Learning