🤖 AI Summary
This work addresses the limitations of existing KV cache compression methods in streaming video understanding, which rely on local heuristics and fail to ensure representative retention of historical visual information. The paper formulates KV cache compression as a coreset selection problem for the first time, proposing a dual-criteria optimization objective in a joint key-value representation space that balances coverage of salient information and numerical diversity. An orthogonality-driven mechanism is introduced to enhance cache quality, and a theoretical connection to log-determinant subset selection is established. Evaluated across four open-source vision-language models and five long-form or streaming video benchmarks, the proposed method significantly outperforms current heuristic-based compression approaches under fixed cache budgets.
📝 Abstract
Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.