Holistic Tokenizer for Autoregressive Image Generation

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional autoregressive image generation models suffer from limited global structural modeling due to local patch-based tokenization and per-token sequential generation. To address this, we propose Hita—a holistic-to-local autoregressive image tokenizer. Hita introduces learnable global queries that interact with local patch tokens, designs a causal attention mechanism and a lightweight fusion module to explicitly encode global semantics—including texture, material, and shape—and employs quantization/dequantization for efficient training. Evaluated on ImageNet, Hita achieves 2.59 FID and 281.9 Inception Score (IS), with significantly accelerated training. It also demonstrates strong generalization in zero-shot style transfer and image inpainting. Our core contribution is the first integration of holistic perception into an autoregressive tokenization architecture, unifying global coherence and local detail modeling within a single framework.

Technology Category

Application Category

📝 Abstract
The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce extit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving extbf{2.59 FID} and extbf{281.9 IS} on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}
Problem

Research questions and friction points this paper is trying to address.

Improves autoregressive image generation with holistic tokenization
Enhances global information capture in visual token sequences
Optimizes token arrangement and fusion for AR generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Holistic-to-local tokenization with learnable queries
Sequential structure with causal attention
Lightweight fusion module for information control
🔎 Similar Papers
No similar papers found.
A
Anlin Zheng
The University of Hong Kong
H
Haochen Wang
NLPR, MAIS, CASIA
Yucheng Zhao
Yucheng Zhao
MEGVII Technology
RobotLarge Language ModelVideo Generation
W
Weipeng Deng
The University of Hong Kong
Tiancai Wang
Tiancai Wang
Dexmal
Computer VisionEmbodied AI
X
Xiangyu Zhang
StepFun, MEGVII Techonology
Xiaojuan Qi
Xiaojuan Qi
Assistant Professor, The University of Hong Kong
3D VisionDeep learningArtificial IntelligenceMedical Image Analysis