Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the significant performance degradation of large audio language models (ALMs) in real-world noisy environments. Existing enhancement methods often rely on task-specific noise data and require costly retraining, limiting their scalability. To overcome these challenges, the authors propose FTL, a plug-and-play audio enhancer that first disentangles speech and non-speech components, then routes the target modality based on user instructions, and finally generates a task-adaptive enhanced signal through modality-aware fusion. Notably, FTL improves model robustness across diverse noise conditions without fine-tuning downstream models. Experimental results demonstrate consistent and substantial performance gains across multiple large audio language models and tasks, highlighting FTL’s modularity, scalability, and efficiency.

Technology Category

Application Category

📝 Abstract

Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs'noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.

Problem

Research questions and friction points this paper is trying to address.

noise robustness

large audio language models

audio enhancement

noisy acoustic conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

plug-and-play audio enhancer

noise-robust LALMs

modality routing

speech-non-speech separation

modality-aware fusion

🔎 Similar Papers

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

2024-09-10arXiv.orgCitations: 1

Apple

Cupertino, United States of America

AI Research Scientist - Meta Superintelligence Labs (PhD)