InfoTok: Regulating Information Flow for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the lack of principled design criteria for shared visual tokens in existing unified multimodal large language models (MLLMs), which struggle to balance image understanding and generation under limited representational capacity. To this end, the authors introduce InfoTok, the first framework that leverages information bottleneck theory to guide the design of shared visual tokens. By applying mutual information regularization, InfoTok modulates the flow of information from images to tokens, effectively balancing redundancy compression against the preservation of task-relevant content—prioritizing reusable structural features while suppressing high-entropy redundancies. Notably, InfoTok requires no additional training data and consistently enhances both image understanding and generation performance across three mainstream unified MLLMs.

Technology Category

Application Category

📝 Abstract

Unified multimodal large language models (MLLMs) integrate image understanding and generation in a single framework, with the visual tokenizer acting as the sole interface that maps visual inputs into tokens for downstream tasks. However, existing shared-token designs are mostly architecture-driven and lack an explicit criterion for what information tokens should preserve to support both understanding and generation. Therefore, we introduce a capacity-constrained perspective, highlighting that in shared-token unified MLLMs the visual tokenizer behaves as a compute-bounded learner, so the token budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this perspective, we propose InfoTok, an information-regularized visual tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok formulates tokenization as controlling information flow from images to shared tokens to multimodal outputs, yielding a principled trade-off between compression and task relevance via mutual-information regularization. We integrate InfoTok into three representative unified MLLMs without introducing any additional training data. Experiments show consistent improvements on both understanding and generation, supporting information-regularized tokenization as a principled foundation for learning a shared token space in unified MLLMs.

Problem

Research questions and friction points this paper is trying to address.

visual tokenization

unified MLLMs

information bottleneck

shared tokens

multimodal understanding and generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Information Bottleneck

visual tokenization

unified MLLMs