OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models

πŸ“… 2026-05-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

206K/year
πŸ€– AI Summary
Existing audio-visual multimodal large language models commonly employ fixed modality compression strategies, which struggle to balance efficiency and performance, often leading to critical information loss or computational redundancy. This work proposes OmniSelectβ€”a training-free, dynamic, modality-aware token compression framework that introduces query-driven multimodal token pruning for the first time. OmniSelect leverages a lightweight AudioCLIP model to assess cross-modal relevance, dynamically categorizing inputs into audio-dominant, video-dominant, or balanced types. It then allocates pruning ratios at a fine-grained level within temporal groups, explicitly modeling modality preferences. Experiments demonstrate that OmniSelect substantially reduces token count while effectively preserving downstream task performance.
πŸ“ Abstract
Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $\textbf{OmniSelect}$, a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.
Problem

Research questions and friction points this paper is trying to address.

token compression
omni-modal large language models
modality-aware
multimodal understanding
dynamic pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

token compression
modality-aware pruning
omnimodal LLMs
dynamic strategy selection
training-free
πŸ”Ž Similar Papers
No similar papers found.