π€ AI Summary
This work addresses the high computational and memory costs incurred by long audio prefixes during inference in audio language models, a challenge exacerbated by existing training-free compression methods that struggle to preserve both content relevance and local acoustic context. The authors propose Local Temporal Bisection Merging (LTBM), a novel approach that introduces explicit temporal window constraints in encoder space and performs content-aware local compression based on similarity between neighboring tokens. LTBM establishes temporal locality as an effective inductive bias for audio token compressionβa principle validated for the first time. By designing a global merging variant to disentangle the effect of locality, the study further reveals the task dependency of compression strategies: under strong compression, LTBM significantly improves audio captioning performance on AudioCaps, Clotho, and MMAU datasets, whereas global matching proves more suitable for multiple-choice audio understanding tasks.
π Abstract
Audio-language models (ALMs) are increasingly used for audio captioning, question answering, and open-ended audio understanding, but their inference cost remains high when audio inputs are represented as long prefix-token sequences. These audio prefixes consume context budget, increase memory usage, and make deployment harder in resource-constrained or latency-sensitive settings. Existing training-free audio-token reduction methods mainly rely on fixed pooling or score-based pruning. Fixed pooling is content-agnostic, while score-based pruning can preserve isolated salient tokens but discard nearby acoustic context. We propose Local Temporal Bipartite Merging (LTBM), a training-free encoder-space compression method that merges similar nearby audio tokens under an explicit temporal window constraint. Beyond introducing LTBM, we use a controlled Global Merge variant to isolate whether temporal locality itself is a useful inductive bias for audio-token compression. Experiments on AudioCaps, Clotho, and MMAU with Qwen2-Audio show evidence of a task-dependent locality effect: locality-aware merging is more favorable for captioning at several compression settings, especially under stronger compression, while global matching is more competitive for multiple-choice audio understanding. A cross-backbone validation on Audio Flamingo 3 further supports the captioning-side advantage of locality-aware merging under moderate and aggressive compression.