SFTok: Bridging the Performance Gap in Discrete Tokenizers

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Discrete tokenizers underperform continuous alternatives in multimodal modeling due to poor reconstruction fidelity and training-inference inconsistency—especially at high compression ratios (e.g., 64 tokens per image) required for high-resolution image generation. Method: We propose a multi-step self-rectifying iterative reconstruction mechanism that jointly mitigates distribution shift and error accumulation during discretization via self-rectifying guided supervision and debiasing-fitting co-training. Built upon an optimized discrete VQ-VAE architecture, our method achieves dynamic consistency correction between training and inference stages for the first time. Contribution/Results: On ImageNet, our approach achieves an rFID of 1.21 (state-of-the-art) and a class-to-image generation gFID of 2.29—significantly narrowing the performance gap between discrete and continuous tokenizers. This work establishes a new paradigm for efficient, high-fidelity multimodal modeling with discrete representations.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose extbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating extbf{self-forcing guided visual reconstruction} and extbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).
Problem

Research questions and friction points this paper is trying to address.

Improves discrete tokenizers for high-resolution image generation
Resolves training-inference inconsistency in multi-step reconstruction
Enhances image quality at high compression rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step iterative mechanism for precise reconstruction
Self-forcing guided visual reconstruction strategy
Debias-and-fitting training to resolve inconsistency
🔎 Similar Papers
No similar papers found.
Q
Qihang Rao
Department of Automation, Tsinghua University, China
Borui Zhang
Borui Zhang
Ph.D. student, Tsinghua University
Computer VisionMachine LearningMetric LearningExplainable AI
Wenzhao Zheng
Wenzhao Zheng
EECS, University of California, Berkeley
Large ModelsEmbodied AgentsAutonomous Driving
J
Jie Zhou
Department of Automation, Tsinghua University, China
J
Jiwen Lu
Department of Automation, Tsinghua University, China