VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

πŸ“… 2025-12-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address redundant computation caused by repeated visual inputs in multimodal large language model (MLLM) inference, this paper proposes VLCacheβ€”a cache reuse framework that jointly reuses both visual encoder outputs and LLM key-value (KV) caches for the first time. We formally model cumulative cache reuse error and design a layer-aware dynamic recomputation strategy to achieve an optimal trade-off between accuracy and efficiency. Our method maintains full recomputation accuracy while recomputing only 2–5% of visual tokens. It reduces time-to-first-token (TTFT) by 1.2×–16Γ— across diverse workloads. The proposed technique has been integrated into the SGLang inference system and deployed in production, establishing an efficient and high-fidelity caching optimization paradigm for multimodal inference.

Technology Category

Application Category

πŸ“ Abstract
This paper presents VLCache, a cache reuse framework that exploits both Key-Value (KV) cache and encoder cache from prior multimodal inputs to eliminate costly recomputation when the same multimodal inputs recur. Unlike previous heuristic approaches, we formally identify the cumulative reuse error effect and demonstrate how to minimize the non-prefix cache reuse error effectively. We further analyze the varying importance of model layers and propose a dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. Experimental results show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups. The proposed VLCache pipeline has been integrated into SGLang, enabling significantly faster inference in practical deployments.
Problem

Research questions and friction points this paper is trying to address.

Reduces recomputation cost for recurring multimodal inputs
Minimizes error in non-prefix cache reuse effectively
Balances accuracy and efficiency with dynamic layer-aware strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache reuse eliminates multimodal input recomputation
Layer-aware recomputation balances accuracy and efficiency
Computes only 2-5% of tokens for 1.2x-16x speedups
S
Shengling Qin
Qwen Team, Alibaba Inc.
H
Hao Yu
Qwen Team, Alibaba Inc.
C
Chenxin Wu
TairKVCache Team, Alibaba Cloud
Z
Zheng Li
Qwen Team, Alibaba Inc.
Y
Yizhong Cao
Qwen Team, Alibaba Inc.
Z
Zhengyang Zhuge
Qwen Team, Alibaba Inc.
Yuxin Zhou
Yuxin Zhou
University of California, Riverside
CombustionNanoparticlesMolecular dynamicsAerosol
W
Wentao Yao
Qwen Team, Alibaba Inc.
Y
Yi Zhang
Qwen Team, Alibaba Inc.
Z
Zhengheng Wang
TairKVCache Team, Alibaba Cloud
Shuai Bai
Shuai Bai
Qwen Team, Alibaba Group
Multi-Modal LearningVisual Generation
J
Jianwei Zhang
Qwen Team, Alibaba Inc.
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining