SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified image tokenizers struggle to simultaneously achieve high-level semantic reconstruction and low-level pixel fidelity, leading to a performance trade-off in multimodal understanding and generation tasks. This paper proposes SemHiTok, a semantic-guided hierarchical codebook tokenizer that introduces a novel “semantic-prior + texture sub-codebook” architecture to decouple semantic and pixel-level representation learning. It constructs a hierarchical codebook via pre-trained semantic embeddings and jointly optimizes semantic reconstruction loss and pixel reconstruction loss in a staged manner, yielding consistent discrete representations. Evaluated at 256×256 resolution, SemHiTok achieves state-of-the-art rFID (3.12) and delivers competitive performance across multimodal understanding tasks (e.g., VQA, image captioning) and generative tasks. Notably, it is the first approach to significantly improve texture modeling accuracy without compromising semantic capability.

Technology Category

Application Category

📝 Abstract
We present SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete feature representations for multimodal understanding and generation tasks. Recently, unified multimodal large models (MLLMs) for understanding and generation have sparked exploration within research community. Previous works attempt to train a unified image tokenizer by combining loss functions for semantic feature reconstruction and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation tasks, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through Semantic-Guided Hierarchical codebook which builds texture sub-codebooks on pre-trained semantic codebook. This design decouples the training of semantic reconstruction and pixel reconstruction and equips the tokenizer with low-level texture feature extraction capability without degradation of high-level semantic feature extraction ability. Our experiments demonstrate that SemHiTok achieves state-of-the-art rFID score at 256X256resolution compared to other unified tokenizers, and exhibits competitive performance on multimodal understanding and generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Unified image tokenizer for multimodal tasks
Semantic-guided hierarchical codebook design
Balancing semantic and pixel reconstruction trade-offs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Guided Hierarchical codebook for image tokenization
Decouples semantic and pixel reconstruction training
Enhances texture and semantic feature extraction capabilities
🔎 Similar Papers
No similar papers found.
Z
Zisheng Chen
Sun Yat-sen University
Chunwei Wang
Chunwei Wang
Researcher, Huawei Noah's Ark Lab
Computer VisionAutonomous DrivingMultimodality
X
Xiuwei Chen
Sun Yat-sen University
H
Hang Xu
Huawei Noah’s Ark Lab
Jianhua Han
Jianhua Han
2030 Research, YinWang, Huawei
Vision Language ModelFoundation ModelVLA
X
Xiandan Liang
Sun Yat-sen University