SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing unified image tokenizers struggle to simultaneously achieve high-level semantic reconstruction and low-level pixel fidelity, leading to a performance trade-off in multimodal understanding and generation tasks. This paper proposes SemHiTok, a semantic-guided hierarchical codebook tokenizer that introduces a novel “semantic-prior + texture sub-codebook” architecture to decouple semantic and pixel-level representation learning. It constructs a hierarchical codebook via pre-trained semantic embeddings and jointly optimizes semantic reconstruction loss and pixel reconstruction loss in a staged manner, yielding consistent discrete representations. Evaluated at 256×256 resolution, SemHiTok achieves state-of-the-art rFID (3.12) and delivers competitive performance across multimodal understanding tasks (e.g., VQA, image captioning) and generative tasks. Notably, it is the first approach to significantly improve texture modeling accuracy without compromising semantic capability.

Technology Category

Application Category

📝 Abstract

We present SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete feature representations for multimodal understanding and generation tasks. Recently, unified multimodal large models (MLLMs) for understanding and generation have sparked exploration within research community. Previous works attempt to train a unified image tokenizer by combining loss functions for semantic feature reconstruction and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation tasks, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through Semantic-Guided Hierarchical codebook which builds texture sub-codebooks on pre-trained semantic codebook. This design decouples the training of semantic reconstruction and pixel reconstruction and equips the tokenizer with low-level texture feature extraction capability without degradation of high-level semantic feature extraction ability. Our experiments demonstrate that SemHiTok achieves state-of-the-art rFID score at 256X256resolution compared to other unified tokenizers, and exhibits competitive performance on multimodal understanding and generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Unified image tokenizer for multimodal tasks

Semantic-guided hierarchical codebook design

Balancing semantic and pixel reconstruction trade-offs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Guided Hierarchical codebook for image tokenization

Decouples semantic and pixel reconstruction training

Enhances texture and semantic feature extraction capabilities

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM