🤖 AI Summary
Existing compression methods often disrupt cross-modal alignment when processing long videos with dense audio, struggling to balance efficiency and performance. This work proposes OmniRefine, a training-free, two-stage audio-visual token compression framework. It first constructs adaptive compression units aligned across modalities by leveraging frame–audio similarity and dynamic programming, then performs modality-aware collaborative token pruning within each unit to preserve critical complementary information. OmniRefine is the first approach to achieve alignment-aware joint compression, overcoming the limitations of fixed-size compression units. On the WorldSense dataset, it attains 46.7% accuracy using only 44% of the original tokens—nearly matching the full-token baseline—and significantly outperforms existing methods.
📝 Abstract
Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.