VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address redundant computation caused by repeated visual inputs in multimodal large language model (MLLM) inference, this paper proposes VLCache—a cache reuse framework that jointly reuses both visual encoder outputs and LLM key-value (KV) caches for the first time. We formally model cumulative cache reuse error and design a layer-aware dynamic recomputation strategy to achieve an optimal trade-off between accuracy and efficiency. Our method maintains full recomputation accuracy while recomputing only 2–5% of visual tokens. It reduces time-to-first-token (TTFT) by 1.2×–16× across diverse workloads. The proposed technique has been integrated into the SGLang inference system and deployed in production, establishing an efficient and high-fidelity caching optimization paradigm for multimodal inference.

Technology Category

Application Category

📝 Abstract

This paper presents VLCache, a cache reuse framework that exploits both Key-Value (KV) cache and encoder cache from prior multimodal inputs to eliminate costly recomputation when the same multimodal inputs recur. Unlike previous heuristic approaches, we formally identify the cumulative reuse error effect and demonstrate how to minimize the non-prefix cache reuse error effectively. We further analyze the varying importance of model layers and propose a dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. Experimental results show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups. The proposed VLCache pipeline has been integrated into SGLang, enabling significantly faster inference in practical deployments.

Problem

Research questions and friction points this paper is trying to address.

Reduces recomputation cost for recurring multimodal inputs

Minimizes error in non-prefix cache reuse effectively

Balances accuracy and efficiency with dynamic layer-aware strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache reuse eliminates multimodal input recomputation

Layer-aware recomputation balances accuracy and efficiency

Computes only 2-5% of tokens for 1.2x-16x speedups

🔎 Similar Papers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference