🤖 AI Summary
This work addresses the substantial computational overhead of Omni-LLMs when processing long multimodal token sequences, a challenge inadequately mitigated by existing compression methods that struggle to balance efficiency and performance. To this end, we propose OmniSIFT, the first modality-asymmetric token compression framework tailored for Omni-LLMs. OmniSIFT reduces redundancy through spatiotemporal video pruning—eliminating both intra- and inter-frame redundancies—and employs a vision-guided audio selection module to filter irrelevant audio tokens. End-to-end optimization is enabled via a differentiable straight-through estimator. Remarkably, with only 25% of the original tokens retained, OmniSIFT outperforms all compression baselines across five benchmarks and even surpasses the full-token model on certain tasks, while introducing merely 4.85M additional parameters and achieving lower inference latency than training-free baselines such as OmniZip.
📝 Abstract
Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.