🤖 AI Summary
This work addresses the efficiency bottlenecks in large vision-language model inference, which are dominated by high-resolution visual features, quadratic-complexity attention mechanisms, and memory bandwidth constraints—collectively forming a “vision-token-dominated” bottleneck. The study proposes the first end-to-end optimization framework that decouples inference into three distinct stages: encoding, prefilling, and decoding. It systematically analyzes stage-specific bottlenecks and their inter-stage couplings, advocating for a balanced trade-off between visual fidelity and computational efficiency through strategies such as information density modulation, long-context attention management, and memory-bound mitigation. Key contributions include a structured taxonomy of optimization techniques, an open-source and maintainable literature repository, insights into combinatorial optimization potential, and a roadmap highlighting four promising directions: function-aware hybrid compression, modality-aware decoding, streaming state management, and hardware-software co-designed staged serving.
📝 Abstract
Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ''visual memory wall'' in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.