EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

AV-LLMs suffer from high computational overhead due to audio-video token redundancy; existing unimodal token pruning methods fail to capture cross-modal semantic synergy, and fixed budget allocation cannot adapt to the dynamic, heterogeneous information density inherent in audio-video streams. This paper proposes the first adaptive token compression framework for joint audio-video sequences: (1) Cross-modal Semantic Screening (CS²), enabling early audio-video interaction and dynamic pruning over a unified token pool; (2) Synchronization-enhanced RoPE (Sync-RoPE), preserving temporal modeling fidelity for sparse tokens. Experiments demonstrate that our method retains only 5–20% of original tokens while matching baseline performance, achieves 2–3× inference speedup, and reduces GPU memory consumption—thereby significantly advancing efficient cross-modal token compression.

Technology Category

Application Category

📝 Abstract

Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in audio-visual LLMs

Adaptively allocates token budget across audio and video

Preserves temporal modeling with sparse token selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal token reduction via single-pool adaptive allocation

Synchronization-augmented RoPE maintains temporal relationships

Early audio-visual interaction through co-attention mechanism

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM