UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing codebook-based multimodal large language models (MLLMs) suffer from coarse semantic granularity and limited expressivity due to small codebooks (~16K entries); naively scaling up codebook size degrades token utilization and destabilizes training. Method: We propose UniCode², the first framework to construct a large-scale (500K-entry), semantically aligned visual codebook. It employs a cascaded architecture comprising a frozen anchor codebook and a trainable task-specific codebook, ensuring training stability and high token efficiency. The codebook is built via SigLIP sequence embedding clustering, and seamlessly integrates autoregressive modeling with a diffusion-based decoder. Contribution/Results: UniCode² achieves state-of-the-art performance across diverse multimodal understanding and generation benchmarks. Experiments validate the feasibility and effectiveness of large-scale, semantically grounded visual tokenization—enabling high-capacity, discrete visual representation without compromising training dynamics or inference efficiency.

Technology Category

Application Category

📝 Abstract
Unified multimodal large language models (MLLMs) have shown promise in jointly advancing multimodal understanding and generation, with visual codebooks discretizing images into tokens for autoregressive modeling. Existing codebook-based methods either rely on small vocabularies (~16K entries) that lack fine-grained semantics or naively scale up, resulting in low token utilization and unstable training. We propose UniCode$^2$, a cascaded codebook framework enabling large-scale, semantically aligned, and stable visual tokenization. By clustering millions of SigLIP sequence embeddings, we build a 500K-entry codebook that preserves vision-language alignment while expanding capacity. Stability is ensured via a cascaded design: a frozen codebook anchors the embedding space, and a trainable codebook refines task-specific semantics. This decoupling promotes high utilization and robust learning. Moreover, the alignment of our visual tokens with textual semantics enables seamless integration with pretrained diffusion decoders, supporting high-quality visual synthesis with minimal adaptation. UniCode^2 delivers strong performance across diverse benchmarks, demonstrating the viability of scaling visual token spaces without sacrificing stability, semantics, or modularity.
Problem

Research questions and friction points this paper is trying to address.

Existing codebooks lack fine-grained semantics or stable training
Scaling visual token spaces sacrifices stability and semantics
Integrating visual tokens with textual semantics is challenging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded codebook framework for stable tokenization
Clustering SigLIP embeddings for large-scale semantics
Alignment with textual semantics enables diffusion integration