HybridToken-VLM: Hybrid Token Compression for Vision-Language Models

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Vision-language models (VLMs) suffer from computational and memory bottlenecks due to the explosive growth of visual tokens; conventional compression methods struggle to simultaneously preserve high-level semantics and reconstruct fine-grained details. This paper proposes HybridToken-VLM, the first VLM architecture featuring a semantic–appearance dual-channel hybrid representation: it decouples and fuses continuous ViT patch embeddings with discrete MGVQ-quantized anchor tokens. We introduce a decoupled attention masking mechanism and a bottleneck design to enable efficient generation of a single visual token per concept. At an extreme 580:1 compression ratio, HybridToken-VLM retains strong semantic guidance capability. Experiments across seven mainstream benchmarks show that it achieves, on average, 87.2% of the original model’s performance—significantly outperforming the best continuous compression baseline (81.0%)—thereby effectively breaking the efficiency–fidelity trade-off barrier.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.

Problem

Research questions and friction points this paper is trying to address.

Compresses visual tokens to reduce computational costs in VLMs

Balances semantic retention with fine-grained detail preservation

Resolves efficiency-fidelity trade-off in multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid framework with dual channels for semantics and appearance

Compresses 580 tokens into one via disentanglement attention mask

Achieves 87.2% performance retention across seven benchmarks

🔎 Similar Papers

No similar papers found.

Authors to Follow