🤖 AI Summary
This work addresses the challenges of long-form video question answering, where the surge in visual tokens and limited context length of large language models lead to excessive memory consumption and loss of fine-grained spatiotemporal information. To overcome these issues, the authors propose MuKV, a novel approach that introduces a multi-granularity key-value (KV) cache structure for the first time. During offline processing, MuKV constructs hierarchical visual representations at the block, frame, and clip levels; during online inference, it employs a semi-hierarchical retrieval mechanism to efficiently access relevant cached features. Additionally, MuKV incorporates a dual-guided compression strategy that synergistically combines self-attention with frequency-domain signals to achieve precise and efficient token compression. Experiments demonstrate that MuKV significantly improves answer accuracy on long-form VideoQA benchmarks while reducing memory usage and maintaining real-time inference efficiency, outperforming all existing baselines across multiple metrics.
📝 Abstract
Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.