HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the high inference cost of large audio language models by proposing HeadRouter, a training-free dynamic head-weight routing mechanism that exploits the previously unobserved sparsity and heterogeneity of attention head responses across diverse audio tasks. Recognizing that existing token compression methods overlook the varying importance of different attention heads, HeadRouter introduces a task-adaptive token pruning strategy guided by importance-aware dynamic routing. This approach enables efficient audio sequence compression while preserving or even enhancing model performance. Evaluated on the AudioMarathon and MMAU-Pro benchmarks, HeadRouter achieves 101.8% and 103.0% of the original performance on Qwen2.5-Omni-3B and 7B models, respectively, using only 70% of the original audio tokens.

Technology Category

Application Category

📝 Abstract

Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression methods usually assume that all attention heads in LALMs contribute equally to various audio tasks and calculate token importance by averaging scores across all heads. However, our analysis demonstrates that attention heads exhibit distinct behaviors across diverse audio domains. We further reveal that only a sparse subset of attention heads actively responds to audio, with completely different performance when handling semantic and acoustic tasks. In light of this observation, we propose HeadRouter, a head-importance-aware token pruning method that perceives the varying importance of attention heads in different audio tasks to maximize the retention of crucial tokens. HeadRouter is training-free and can be applied to various LALMs. Extensive experiments on the AudioMarathon and MMAU-Pro benchmarks demonstrate that HeadRouter achieves state-of-the-art compression performance, exceeding the baseline model even when retaining 70% of the audio tokens and achieving 101.8% and 103.0% of the vanilla average on Qwen2.5-Omni-3B and Qwen2.5-Omni-7B, respectively.

Problem

Research questions and friction points this paper is trying to address.

token pruning

attention heads

audio language models

task adaptation

token compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

HeadRouter

attention head sparsity

task-adaptive pruning