🤖 AI Summary
This work addresses the incompatibility between the continuous, highly correlated image embeddings produced by existing visual projectors and the discrete, semantically distinct tokens required by language models. To bridge this gap, the authors propose Decoupled Visual Tokenization (DiVT), which transforms patch-level image embeddings into semantically coherent discrete visual tokens via clustering and adaptively adjusts the number of tokens based on image content. DiVT is the first method to generate semantically consistent, variable-length visual token sequences without modifying either the visual encoder or the language model, thereby enhancing multimodal alignment. Experimental results demonstrate that DiVT achieves competitive or superior performance across multiple benchmarks using significantly fewer visual tokens, effectively reducing memory consumption and inference latency.
📝 Abstract
Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.