UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the high computational and memory overhead of unified vision-language models caused by processing a large number of image tokens, which hinders their deployment in resource-constrained settings. To this end, the authors propose a plug-and-play visual token compression and decompression mechanism that leverages learnable global meta-tokens to jointly support both vision understanding and generation tasks within a unified autoregressive framework. This approach achieves efficient token compression for the first time in a unified architecture, requiring only lightweight modules that can be seamlessly integrated into existing models without full retraining. Experimental results demonstrate a four-fold reduction in image token count, substantially lowering inference latency and training costs while incurring only minimal performance degradation.

Technology Category

Application Category

📝 Abstract

Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.

Problem

Research questions and friction points this paper is trying to address.

token compression

unified vision-language models

computational efficiency

visual tokens

resource-constrained deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

token compression

unified vision-language modeling

meta tokens