Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the limitations of existing image tokenization methods that employ fixed compression ratios, which ignore spatial variations in local information density and often result in either redundancy or information loss. The authors propose TaTok, an adaptive image tokenization framework grounded in information entropy theory. They theoretically demonstrate, for the first time, that relying solely on local image patches leads to insufficient information representation and inter-patch redundancy. To overcome this, TaTok introduces global tokens to model inter-patch mutual information and incorporates a Dynamic Token Filtering (DTF) strategy based on cumulative conditional entropy, enabling adaptive token allocation according to local information richness. Through a mutual enhancement mechanism between global and local tokens, TaTok achieves superior efficiency without compromising reconstruction quality, yielding a 1.3× improvement in gFID and an 8.7× inference speedup over baseline methods, thereby striking a better trade-off between compression ratio and reconstruction fidelity.
📝 Abstract
Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.
Problem

Research questions and friction points this paper is trying to address.

image tokenization
information redundancy
information insufficiency
variable information density
discrete representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive tokenization
global tokens
mutual information
dynamic token filtering
information entropy
X
Xiusheng Huang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences; Beijing Academy of Artificial Intelligence
Xin Jiang
Xin Jiang
Beijing Academy of Artificial Intelligence
LLMMultimodal ModelEmbodied AI
Jun Zhao
Jun Zhao
School of Marine Sciences, Sun Yat-sen University
ocean opticsremote sensingnumerical modeling
K
Kang Liu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yequan Wang
Beijing Academy of Artificial Intelligence