WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

Existing unified visual tokenizers struggle to simultaneously achieve high-level semantic abstraction and low-level pixel reconstruction due to conflicting objectives between understanding and generation tasks. This work proposes WinTok, the first framework to harmonize both goals within a single lightweight token space. WinTok explicitly decouples task objectives through a hybrid token structure, enhances abstraction capacity via learnable semantic tokens, and effectively transfers discriminative knowledge from large vision foundation models through an asymmetric token distillation mechanism. Trained on only 50M publicly available images, WinTok outperforms current methods across ten benchmarks, achieving an 11.2% absolute improvement in classification accuracy over UniTok and a remarkably low rFID of 0.41—significantly surpassing dual-tokenizer approaches.

📝 Abstract

Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.

Problem

Research questions and friction points this paper is trying to address.

visual tokenizer

visual understanding

visual generation

token space conflict

semantic abstraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid tokenizer

task decoupling

semantic tokens