🤖 AI Summary
This work addresses the limitations of existing image tokenization methods that employ fixed compression ratios, which ignore spatial variations in local information density and often result in either redundancy or information loss. The authors propose TaTok, an adaptive image tokenization framework grounded in information entropy theory. They theoretically demonstrate, for the first time, that relying solely on local image patches leads to insufficient information representation and inter-patch redundancy. To overcome this, TaTok introduces global tokens to model inter-patch mutual information and incorporates a Dynamic Token Filtering (DTF) strategy based on cumulative conditional entropy, enabling adaptive token allocation according to local information richness. Through a mutual enhancement mechanism between global and local tokens, TaTok achieves superior efficiency without compromising reconstruction quality, yielding a 1.3× improvement in gFID and an 8.7× inference speedup over baseline methods, thereby striking a better trade-off between compression ratio and reconstruction fidelity.
📝 Abstract
Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.