Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This work addresses the high computational overhead of Omni-LLMs in processing multimodal inputs and the tendency of existing pruning methods to discard critical visual context. The authors propose ContextGuard, a novel framework that shifts the pruning objective from query relevance to cross-modal context preservation. Specifically, a lightweight audio-to-visual semantic predictor identifies redundant video tokens recoverable from audio, which are then pruned, while preserving visually distinctive details irrecoverable from audio. Temporally similar tokens are further merged to enhance efficiency. Notably, ContextGuard requires no model fine-tuning and achieves an average input token reduction of 55% across six audio-visual benchmarks on Qwen2.5-Omni and Video-SALMONN2+ (3B/7B), matching or exceeding full-token performance on five benchmarks and significantly outperforming current inference-time pruning approaches.
📝 Abstract
Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.
Problem

Research questions and friction points this paper is trying to address.

Omni-LLMs
token pruning
multimodal context
audio-visual redundancy
computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

token pruning
multimodal context preservation
audio-visual redundancy
inference-time compression
Omni-LLMs
🔎 Similar Papers
No similar papers found.