Mosaic: Cross-Modal Clustering for Efficient Video Understanding

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

257K/year
🤖 AI Summary
This work addresses the escalating computational and memory overhead caused by continuously growing KVCache in streaming long-form video understanding. It reveals, for the first time, an implicit cross-modal clustering structure within the KVCache of vision-language models (VLMs) and introduces a cluster-level KVCache management mechanism. By treating clusters—formed jointly by visual coherence and semantic relevance—as fundamental units, the method leverages attention sparsity and employs a GPU-CPU cache offloading strategy to enable efficient cache organization, maintenance, and retrieval. Experimental results demonstrate that the proposed system achieves up to 1.38× speedup over the state-of-the-art approach, significantly reducing inference latency and resource consumption.

Technology Category

Application Category

📝 Abstract
Large vision-language models (VLMs) are enabling interactive video reasoning, giving rise to streaming long-video understanding. In this setting, frames arrive continuously, while the system preserves long-term context and generates responses under strict latency constraints. A central challenge is KVCache management: as video streams grow, KVCache expands rapidly, increasing computation and memory overhead. Existing retrieval-based approaches exploit attention sparsity and offload inactive KVCache from GPU to CPU memory, but their token-level design causes high management overhead and fragmented data movement. We present Mosaic, the first cluster-driven VLM inference system for streaming long-video understanding. Our key insight is that VLM KVCache exhibits an implicit cross-modal clustering structure: retrieved KV states form groups jointly shaped by visual coherence and semantic relevance. Based on this observation, Mosaic uses cross-modal clusters as the basic unit of KVCache organization, maintenance, and retrieval. Evaluations show that Mosaic outperforms state-of-the-art baselines, achieving up to 1.38x speedup.
Problem

Research questions and friction points this paper is trying to address.

KVCache management
streaming long-video understanding
vision-language models
attention sparsity
memory overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal clustering
KVCache management
streaming video understanding
vision-language models
efficient inference
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30