Locality Matters for Training-Free Audio Token Compression in Audio-Language Models

πŸ“… 2026-05-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high computational and memory costs incurred by long audio prefixes during inference in audio language models, a challenge exacerbated by existing training-free compression methods that struggle to preserve both content relevance and local acoustic context. The authors propose Local Temporal Bisection Merging (LTBM), a novel approach that introduces explicit temporal window constraints in encoder space and performs content-aware local compression based on similarity between neighboring tokens. LTBM establishes temporal locality as an effective inductive bias for audio token compressionβ€”a principle validated for the first time. By designing a global merging variant to disentangle the effect of locality, the study further reveals the task dependency of compression strategies: under strong compression, LTBM significantly improves audio captioning performance on AudioCaps, Clotho, and MMAU datasets, whereas global matching proves more suitable for multiple-choice audio understanding tasks.
πŸ“ Abstract
Audio-language models (ALMs) are increasingly used for audio captioning, question answering, and open-ended audio understanding, but their inference cost remains high when audio inputs are represented as long prefix-token sequences. These audio prefixes consume context budget, increase memory usage, and make deployment harder in resource-constrained or latency-sensitive settings. Existing training-free audio-token reduction methods mainly rely on fixed pooling or score-based pruning. Fixed pooling is content-agnostic, while score-based pruning can preserve isolated salient tokens but discard nearby acoustic context. We propose Local Temporal Bipartite Merging (LTBM), a training-free encoder-space compression method that merges similar nearby audio tokens under an explicit temporal window constraint. Beyond introducing LTBM, we use a controlled Global Merge variant to isolate whether temporal locality itself is a useful inductive bias for audio-token compression. Experiments on AudioCaps, Clotho, and MMAU with Qwen2-Audio show evidence of a task-dependent locality effect: locality-aware merging is more favorable for captioning at several compression settings, especially under stronger compression, while global matching is more competitive for multiple-choice audio understanding. A cross-backbone validation on Audio Flamingo 3 further supports the captioning-side advantage of locality-aware merging under moderate and aggressive compression.
Problem

Research questions and friction points this paper is trying to address.

audio-language models
token compression
temporal locality
inference efficiency
context budget
Innovation

Methods, ideas, or system contributions that make the work stand out.

Local Temporal Bipartite Merging
training-free compression
temporal locality
audio-language models
token reduction