SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Visual language models (VLMs) suffer from high inference latency and poor scalability when processing high-resolution images or long videos, due to the quadratic growth of visual tokens. This work proposes SparseVILA—the first framework to decouple visual sparsity across both prefilling and decoding stages. During prefilling, it applies data-free redundant token pruning; during decoding, it dynamically retrieves relevant visual tokens via query-based attention and reuses cached visual representations to support multi-turn dialogue. The method requires no fine-tuning, is architecture-agnostic across mainstream VLMs, and integrates AWQ quantization with an optimized inference pipeline. Experiments demonstrate end-to-end speedups of 2.6× on long-video understanding (4.0× in prefilling, 2.5× in decoding), while preserving or even improving accuracy on document understanding and complex reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.

Problem

Research questions and friction points this paper is trying to address.

Reduces visual token redundancy to accelerate VLM inference

Decouples sparsity management between prefilling and decoding stages

Maintains accuracy while speeding up multimodal model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples visual sparsity across prefilling and decoding stages

Prunes redundant tokens during prefill for faster processing

Retrieves query-relevant tokens during decoding for accuracy

🔎 Similar Papers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference