🤖 AI Summary
This work addresses the significant computational and memory bottlenecks in multimodal large language models caused by the excessive number of visual tokens relative to textual tokens. Existing compression methods rely on fixed heuristic strategies that struggle to generalize across diverse scenarios. To overcome this limitation, we propose QMoP, a novel query-guided mixture-of-projections framework that integrates three complementary compression pathways—pooling, semantic resampling, and fine-grained pruning—within a unified architecture. Leveraging a Query-Guided Router and a Mixture-of-Experts mechanism, QMoP dynamically weights these strategies to enable adaptive visual token compression tailored to the input text query. Evaluated on our newly introduced VTCBench benchmark, QMoP substantially outperforms strong baselines, achieving notable reductions in memory consumption, computational cost, and inference latency while preserving model performance.
📝 Abstract
Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios. In this paper, we first propose Query Guided Mixture-of-Projector (QMoP), a novel and flexible framework that adaptively compresses visual tokens via three collaborative branches: (1) a pooling-based branch for coarse-grained global semantics, (2) a resampler branch for extracting high-level semantic representations, and (3) a pruning-based branch for fine-grained token selection to preserve critical visual detail. To adaptively coordinate these branches, we introduce the Query Guided Router (QGR), which dynamically selects and weights the outputs from different branches based on both visual input and textual queries. A Mixture-of-Experts-style fusion mechanism is designed to aggregate the outputs, harnessing the strengths of each strategy while suppressing noise. To systematically evaluate the effects of Visual Token Compression, we also develop VTCBench, a dedicated benchmark for evaluating the information loss induced by visual token compression. Extensive experiments demonstrate that despite relying on fundamental compression modules, QMoP outperforms strong baselines and delivers significant savings in memory, computation, and inference time.