🤖 AI Summary
This work addresses the high computational overhead of Omni-LLMs in processing multimodal inputs and the tendency of existing pruning methods to discard critical visual context. The authors propose ContextGuard, a novel framework that shifts the pruning objective from query relevance to cross-modal context preservation. Specifically, a lightweight audio-to-visual semantic predictor identifies redundant video tokens recoverable from audio, which are then pruned, while preserving visually distinctive details irrecoverable from audio. Temporally similar tokens are further merged to enhance efficiency. Notably, ContextGuard requires no model fine-tuning and achieves an average input token reduction of 55% across six audio-visual benchmarks on Qwen2.5-Omni and Video-SALMONN2+ (3B/7B), matching or exceeding full-token performance on five benchmarks and significantly outperforming current inference-time pruning approaches.
📝 Abstract
Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.