G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

๐Ÿ“… 2026-05-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

172K/year
๐Ÿค– AI Summary
This work addresses the high computational cost of existing unified multimodal models during inference, which stems from processing dense visual tokens, and the inability of current compression methods to preserve image editing capabilities. The authors propose a generation-guided visual token compression framework that introduces task-agnostic signals from a generative branch after the understanding encoder. By evaluating consistency in the VAE latent space, the method identifies tokens critical for both semantic understanding and image reconstruction. Efficient compression is achieved through a combination of balanced sampling and redundancy merging. Notably, this approach is training-free and plug-and-play, marking the first use of generative signals for token compression on the understanding side. It reduces visual token count and prefill computation by 1.94ร— while maintaining inference accuracy and editing quality, significantly outperforming existing baselines.
๐Ÿ“ Abstract
The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.
Problem

Research questions and friction points this paper is trying to address.

visual token reduction
unified multimodal models
separate-encoder
image editing
inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual token reduction
generation-guided
unified multimodal models
latent-space reconstruction
plug-and-play efficiency