Towards Scalable Pre-training of Visual Tokenizers for Generation

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pixel-level reconstruction paradigms (e.g., VAE-based visual tokenizers) bias latent spaces toward low-level features, preventing gains in pretraining accuracy from translating into improved generative quality—creating a scalability bottleneck. We identify and formalize this as the “pretraining scaling problem.” Method: We propose the Visual Tokenizer Pretraining (VTP) framework, the first to shift visual tokenizer training objectives from pixel reconstruction to high-level semantic representation learning. VTP jointly optimizes three losses: image-text contrastive learning (CLIP-style), masked autoencoding (MAE-style), and variational reconstruction. Contribution/Results: On ImageNet, VTP achieves 78.2% zero-shot classification accuracy and 0.36 rFID. It accelerates generative convergence by 4.1× and improves downstream FID by 65.8% with only increased pretraining compute—establishing the “understanding-driven generation” paradigm.

Technology Category

Application Category

📝 Abstract
The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.
Problem

Research questions and friction points this paper is trying to address.

Standard visual tokenizer training prioritizes low-level pixel accuracy over high-level semantic understanding
Current reconstruction-based methods scale poorly with increased computational resources for generation tasks
Better pixel reconstruction does not translate to improved generative model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint optimization of multiple losses for visual tokenizer
Scales generative performance with compute and data
Achieves faster convergence and better downstream generation
🔎 Similar Papers
No similar papers found.