CAT: Content-Adaptive Image Tokenization

📅 2025-01-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image tokenization methods employ a fixed number of tokens, ignoring variations in semantic complexity across images—leading to suboptimal encoding efficiency and perceptual distortion. This paper proposes CAT (Content-Adaptive Tokenizer), the first image tokenizer that dynamically allocates tokens based on semantic complexity: it leverages an LLM-driven caption assessment module to predict optimal compression ratios per image; supports variable-length latent representations; and integrates seamlessly into the Diffusion Transformer (DiT) framework. On ImageNet generation, CAT achieves significantly lower FID than fixed-token baselines under equivalent computational budgets, improves inference throughput by 18.5%, and maintains robust reconstruction quality and strong generalization. Its core contribution lies in explicitly modeling image semantic complexity as the principled basis for tokenization—thereby unifying optimization of encoding efficiency, generative fidelity, and computational efficiency.

Technology Category

Application Category

📝 Abstract
Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
Problem

Research questions and friction points this paper is trying to address.

Image Processing
Content-aware Partitioning
Complexity-aware Image Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Content-Adaptive Tokenizer
Diffusion Transformers
Image Processing Efficiency
🔎 Similar Papers
No similar papers found.