DINO-Tok: Adapting DINO for Visual Tokenizers

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing vision tokenizers are typically trained from scratch, struggling to jointly optimize semantic representation in high-dimensional latent spaces and reconstruction fidelity. This paper proposes DINO-Tok, the first tokenizer that systematically leverages multi-level features from a pre-trained DINO model for vision tokenization: it fuses low-level details and high-level semantics to construct information-rich hierarchical representations; introduces a global PCA-driven channel reweighting mechanism to mitigate information loss and codebook collapse in vector quantization (VQ); and optimizes the VQ process via feature concatenation and dimensionality reduction. Evaluated on ImageNet at 256×256 resolution, DINO-Tok achieves 28.54 PSNR (autoencoding) and 23.98 PSNR (VQ), outperforming mainstream tokenizers and matching the performance of models trained on billion-scale datasets. By effectively bridging pixel-level fidelity and semantic understanding, DINO-Tok enables high-fidelity generative modeling.

Technology Category

Application Category

📝 Abstract

Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$ imes$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.

Problem

Research questions and friction points this paper is trying to address.

Improving semantic representation and reconstruction fidelity in visual tokenizers

Addressing vector quantization challenges in high-dimensional latent spaces

Bridging pretrained representations with visual generation through unified tokenization

Innovation

Methods, ideas, or system contributions that make the work stand out.

DINO-based visual tokenizer with hierarchical representations

Global PCA reweighting stabilizes vector quantization

Integrates shallow fine details with deep semantics

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM