AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address KV cache redundancy, cross-modal attention misalignment, semantic interference, and degraded modality alignment arising from temporal expansion in audio-visual large language models (AV-LLMs), this paper proposes an adaptive focusing and cross-modal calibration framework for KV cache optimization. Our method dynamically selects salient modalities and tokens per layer via layer-wise adaptive focusing, and jointly applies attention redistribution, intra-modal integration, and cross-modal alignment to achieve fine-grained cache compression while preserving semantic consistency during inference. Experiments on mainstream AV-LLMs show an average 38.2% speedup in inference latency and a 41.7% reduction in GPU memory consumption, with negligible accuracy degradation (±0.3%). The core contribution is the first integration of dynamic modality-aware focusing with structured cross-modal calibration into KV cache management—effectively balancing computational efficiency and multimodal representational fidelity.

Technology Category

Application Category

📝 Abstract
Recent advancements in Audio-Video Large Language Models (AV-LLMs) have enhanced their capabilities in tasks like audio-visual question answering and multimodal dialog systems. Video and audio introduce an extended temporal dimension, resulting in a larger key-value (KV) cache compared to static image embedding. A naive optimization strategy is to selectively focus on and retain KV caches of audio or video based on task. However, in the experiment, we observed that the attention of AV-LLMs to various modalities in the high layers is not strictly dependent on the task. In higher layers, the attention of AV-LLMs shifts more towards the video modality. In addition, we also found that directly integrating temporal KV of audio and spatial-temporal KV of video may lead to information confusion and significant performance degradation of AV-LLMs. If audio and video are processed indiscriminately, it may also lead to excessive compression or reservation of a certain modality, thereby disrupting the alignment between modalities. To address these challenges, we propose AccKV, an Adaptive-Focusing and Cross-Calibration KV cache optimization framework designed specifically for efficient AV-LLMs inference. Our method is based on layer adaptive focusing technology, selectively focusing on key modalities according to the characteristics of different layers, and enhances the recognition of heavy hitter tokens through attention redistribution. In addition, we propose a Cross-Calibration technique that first integrates inefficient KV caches within the audio and video modalities, and then aligns low-priority modalities with high-priority modalities to selectively evict KV cache of low-priority modalities. The experimental results show that AccKV can significantly improve the computational efficiency of AV-LLMs while maintaining accuracy.
Problem

Research questions and friction points this paper is trying to address.

Optimizing KV cache for efficient audio-video LLMs inference
Addressing modality attention imbalance in higher network layers
Preventing information confusion between audio and video modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer adaptive focusing on key modalities
Attention redistribution for heavy hitter tokens
Cross-calibration aligning low-priority with high-priority modalities
🔎 Similar Papers
No similar papers found.
Zhonghua Jiang
Zhonghua Jiang
Zhejiang University
Multimodal LLMEfficient AI3D GenerationFederated Learning
K
Kui Chen
Zhejiang University
K
Kunxi Li
Zhejiang University
K
Keting Yin
Zhejiang University
Yiyun Zhou
Yiyun Zhou
Zhejiang University
Data MiningMultimodal LearningLarge Language Model
Zhaode Wang
Zhaode Wang
Alibaba
C
Chengfei Lv
Alibaba Group
S
Shengyu Zhang
Zhejiang University