OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing audio-visual multimodal large language models commonly employ fixed modality compression strategies, which struggle to balance efficiency and performance, often leading to critical information loss or computational redundancy. This work proposes OmniSelect—a training-free, dynamic, modality-aware token compression framework that introduces query-driven multimodal token pruning for the first time. OmniSelect leverages a lightweight AudioCLIP model to assess cross-modal relevance, dynamically categorizing inputs into audio-dominant, video-dominant, or balanced types. It then allocates pruning ratios at a fine-grained level within temporal groups, explicitly modeling modality preferences. Experiments demonstrate that OmniSelect substantially reduces token count while effectively preserving downstream task performance.

📝 Abstract

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $\textbf{OmniSelect}$, a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

Problem

Research questions and friction points this paper is trying to address.

token compression

omni-modal large language models

modality-aware

multimodal understanding

dynamic pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

token compression

modality-aware pruning

omnimodal LLMs