Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the efficiency bottlenecks in large vision-language model inference, which are dominated by high-resolution visual features, quadratic-complexity attention mechanisms, and memory bandwidth constraints—collectively forming a “vision-token-dominated” bottleneck. The study proposes the first end-to-end optimization framework that decouples inference into three distinct stages: encoding, prefilling, and decoding. It systematically analyzes stage-specific bottlenecks and their inter-stage couplings, advocating for a balanced trade-off between visual fidelity and computational efficiency through strategies such as information density modulation, long-context attention management, and memory-bound mitigation. Key contributions include a structured taxonomy of optimization techniques, an open-source and maintainable literature repository, insights into combinatorial optimization potential, and a roadmap highlighting four promising directions: function-aware hybrid compression, modality-aware decoding, streaming state management, and hardware-software co-designed staged serving.
📝 Abstract
Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ''visual memory wall'' in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.
Problem

Research questions and friction points this paper is trying to address.

Large Vision-Language Models
visual token dominance
inference efficiency
memory bandwidth constraints
quadratic attention scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual token dominance
inference lifecycle
memory bandwidth constraints
long-context attention
hardware-algorithm co-design
🔎 Similar Papers
2024-03-04Computer Vision and Pattern RecognitionCitations: 3
Jun Zhang
Jun Zhang
Bosch Security Systems B.V.
Computer VisionMachine LearningImage Processing
Y
Yicheng Ji
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
F
Feiyang Ren
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Y
Yihang Li
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
B
Bowen Zeng
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Z
Zonghao Chen
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Ke Chen
Ke Chen
Associate Professor of Computer Science, Zhejiang University
database system
Lidan Shou
Lidan Shou
Professor of Computer Science, Zhejiang University
DatabaseData & Knowledge ManagementML Systems
Gang Chen
Gang Chen
Florida State University
Environmental Engineering
Huan Li
Huan Li
ZJU100 Young Professor
AI Data PreparationEfficient AISpatiotemporal Data