Language-Guided Image Tokenization for Generation

📅 2024-12-08
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing image tokenization methods suffer from low compression ratios, hindering both efficiency and quality in high-resolution generative modeling. To address this, we propose Text-Conditioned Image Tokenization (TexTok), the first framework to incorporate linguistic priors into image tokenization. TexTok employs a text-conditioned variational autoencoder to decouple semantic modeling from fine-detail reconstruction, augmented by cross-modal feature alignment and end-to-end co-training with Diffusion Transformers (DiT). On ImageNet-512, TexTok achieves state-of-the-art performance: 48.1% lower reconstruction FID and 34.3% lower generation FID (FID = 1.62), alongside a 93.5× inference speedup and out-of-the-box text-guided generation. Its core innovation lies in leveraging textual guidance to construct highly compressed yet high-fidelity latent image representations—thereby breaking the long-standing accuracy–efficiency trade-off inherent in unconditional tokenization.

Technology Category

Application Category

📝 Abstract
Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide a compact, high-level semantic representation. By conditioning the tokenization process on descriptive text captions, TexTok simplifies semantic learning, allowing more learning capacity and token space to be allocated to capture fine-grained visual details, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization. Project page is at: https://kaiwenzha.github.io/textok/.
Problem

Research questions and friction points this paper is trying to address.

Improving image tokenization compression rates for high-resolution generation
Leveraging language for compact semantic image representation
Enhancing reconstruction and generation quality with text-conditioned tokenization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-guided image tokenization for compression
Text-conditioned tokenization enhances semantic representation
Achieves high-resolution generation with speedup
🔎 Similar Papers
No similar papers found.