SFTok: Bridging the Performance Gap in Discrete Tokenizers

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Discrete tokenizers underperform continuous alternatives in multimodal modeling due to poor reconstruction fidelity and training-inference inconsistency—especially at high compression ratios (e.g., 64 tokens per image) required for high-resolution image generation. Method: We propose a multi-step self-rectifying iterative reconstruction mechanism that jointly mitigates distribution shift and error accumulation during discretization via self-rectifying guided supervision and debiasing-fitting co-training. Built upon an optimized discrete VQ-VAE architecture, our method achieves dynamic consistency correction between training and inference stages for the first time. Contribution/Results: On ImageNet, our approach achieves an rFID of 1.21 (state-of-the-art) and a class-to-image generation gFID of 2.29—significantly narrowing the performance gap between discrete and continuous tokenizers. This work establishes a new paradigm for efficient, high-fidelity multimodal modeling with discrete representations.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose extbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating extbf{self-forcing guided visual reconstruction} and extbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).

Problem

Research questions and friction points this paper is trying to address.

Improves discrete tokenizers for high-resolution image generation

Resolves training-inference inconsistency in multi-step reconstruction

Enhances image quality at high compression rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step iterative mechanism for precise reconstruction

Self-forcing guided visual reconstruction strategy

Debias-and-fitting training to resolve inconsistency

🔎 Similar Papers

No similar papers found.