AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

To address KV cache redundancy, cross-modal attention misalignment, semantic interference, and degraded modality alignment arising from temporal expansion in audio-visual large language models (AV-LLMs), this paper proposes an adaptive focusing and cross-modal calibration framework for KV cache optimization. Our method dynamically selects salient modalities and tokens per layer via layer-wise adaptive focusing, and jointly applies attention redistribution, intra-modal integration, and cross-modal alignment to achieve fine-grained cache compression while preserving semantic consistency during inference. Experiments on mainstream AV-LLMs show an average 38.2% speedup in inference latency and a 41.7% reduction in GPU memory consumption, with negligible accuracy degradation (±0.3%). The core contribution is the first integration of dynamic modality-aware focusing with structured cross-modal calibration into KV cache management—effectively balancing computational efficiency and multimodal representational fidelity.

Technology Category

Application Category

📝 Abstract

Recent advancements in Audio-Video Large Language Models (AV-LLMs) have enhanced their capabilities in tasks like audio-visual question answering and multimodal dialog systems. Video and audio introduce an extended temporal dimension, resulting in a larger key-value (KV) cache compared to static image embedding. A naive optimization strategy is to selectively focus on and retain KV caches of audio or video based on task. However, in the experiment, we observed that the attention of AV-LLMs to various modalities in the high layers is not strictly dependent on the task. In higher layers, the attention of AV-LLMs shifts more towards the video modality. In addition, we also found that directly integrating temporal KV of audio and spatial-temporal KV of video may lead to information confusion and significant performance degradation of AV-LLMs. If audio and video are processed indiscriminately, it may also lead to excessive compression or reservation of a certain modality, thereby disrupting the alignment between modalities. To address these challenges, we propose AccKV, an Adaptive-Focusing and Cross-Calibration KV cache optimization framework designed specifically for efficient AV-LLMs inference. Our method is based on layer adaptive focusing technology, selectively focusing on key modalities according to the characteristics of different layers, and enhances the recognition of heavy hitter tokens through attention redistribution. In addition, we propose a Cross-Calibration technique that first integrates inefficient KV caches within the audio and video modalities, and then aligns low-priority modalities with high-priority modalities to selectively evict KV cache of low-priority modalities. The experimental results show that AccKV can significantly improve the computational efficiency of AV-LLMs while maintaining accuracy.

Problem

Research questions and friction points this paper is trying to address.

Optimizing KV cache for efficient audio-video LLMs inference

Addressing modality attention imbalance in higher network layers

Preventing information confusion between audio and video modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer adaptive focusing on key modalities

Attention redistribution for heavy hitter tokens

Cross-calibration aligning low-priority with high-priority modalities

🔎 Similar Papers

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference