AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large audio-language models (LALMs) suffer from instruction sensitivity: semantically identical intents yield unstable outputs due to minor variations in natural language instruction phrasing. To address this, we propose the “functional pathway” mechanism—a novel paradigm that identifies decoder attention heads strongly correlated with specific acoustic tasks. By applying lightweight fine-tuning to selectively mask these task-specific heads, our method enables precise task triggering without relying on natural language instructions. Implemented on a decoder-only architecture, it requires only minimal parameter updates atop pretrained LALMs. Experiments demonstrate that our approach matches or surpasses instruction-driven baselines on both single and composite audio tasks, while significantly improving stability, robustness, and controllability in task specification. This work establishes a new instruction-agnostic paradigm for semantic control of audio.

Technology Category

Application Category

📝 Abstract
Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from instruction sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain "functional pathways" in their attention heads.
Problem

Research questions and friction points this paper is trying to address.

Reducing instruction sensitivity in audio language models
Enabling task specification without explicit instructions
Identifying functional pathways in attention heads
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masking attention heads to trigger acoustic tasks
Training with head-count parameters for efficiency
Achieving reliable performance without instruction sensitivity
🔎 Similar Papers
No similar papers found.
Y
Yiwei Guo
X-LANCE Lab, MoE Kay Lab of Artificial Intelligence, School of Computer Science, Shanghai Jiao Tong University, China
B
Bohan Li
X-LANCE Lab, MoE Kay Lab of Artificial Intelligence, School of Computer Science, Shanghai Jiao Tong University, China
Hankun Wang
Hankun Wang
Shanghai Jiao Tong University
Speech Synthesis
Zhihan Li
Zhihan Li
Kuaishou Technology, Tsinghua University
Anomaly DetectionAIOps
S
Shuai Wang
Nanjing University, China
X
Xie Chen
X-LANCE Lab, MoE Kay Lab of Artificial Intelligence, School of Computer Science, Shanghai Jiao Tong University, China
K
Kai Yu
X-LANCE Lab, MoE Kay Lab of Artificial Intelligence, School of Computer Science, Shanghai Jiao Tong University, China