HybridToken-VLM: Hybrid Token Compression for Vision-Language Models

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) suffer from computational and memory bottlenecks due to the explosive growth of visual tokens; conventional compression methods struggle to simultaneously preserve high-level semantics and reconstruct fine-grained details. This paper proposes HybridToken-VLM, the first VLM architecture featuring a semantic–appearance dual-channel hybrid representation: it decouples and fuses continuous ViT patch embeddings with discrete MGVQ-quantized anchor tokens. We introduce a decoupled attention masking mechanism and a bottleneck design to enable efficient generation of a single visual token per concept. At an extreme 580:1 compression ratio, HybridToken-VLM retains strong semantic guidance capability. Experiments across seven mainstream benchmarks show that it achieves, on average, 87.2% of the original model’s performance—significantly outperforming the best continuous compression baseline (81.0%)—thereby effectively breaking the efficiency–fidelity trade-off barrier.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.
Problem

Research questions and friction points this paper is trying to address.

Compresses visual tokens to reduce computational costs in VLMs
Balances semantic retention with fine-grained detail preservation
Resolves efficiency-fidelity trade-off in multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid framework with dual channels for semantics and appearance
Compresses 580 tokens into one via disentanglement attention mask
Achieves 87.2% performance retention across seven benchmarks
🔎 Similar Papers
No similar papers found.
J
Jusheng Zhang
Sun Yat-sen University
Xiaoyang Guo
Xiaoyang Guo
Florida State University
Statistical Shape AnalysisGraphComputer VisionMachine Learning
K
Kaitong Cai
Sun Yat-sen University
Q
Qinhan Lv
Sun Yat-sen University
Y
Yijia Fan
Sun Yat-sen University
Wenhao Chai
Wenhao Chai
Princeton University
Machine LearningComputer Vision
J
Jian Wang
Snap Inc.
K
Keze Wang
Sun Yat-sen University