Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance degradation of large audio language models (ALMs) in real-world noisy environments. Existing enhancement methods often rely on task-specific noise data and require costly retraining, limiting their scalability. To overcome these challenges, the authors propose FTL, a plug-and-play audio enhancer that first disentangles speech and non-speech components, then routes the target modality based on user instructions, and finally generates a task-adaptive enhanced signal through modality-aware fusion. Notably, FTL improves model robustness across diverse noise conditions without fine-tuning downstream models. Experimental results demonstrate consistent and substantial performance gains across multiple large audio language models and tasks, highlighting FTL’s modularity, scalability, and efficiency.

Technology Category

Application Category

📝 Abstract
Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs'noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.
Problem

Research questions and friction points this paper is trying to address.

noise robustness
large audio language models
audio enhancement
noisy acoustic conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

plug-and-play audio enhancer
noise-robust LALMs
modality routing
speech-non-speech separation
modality-aware fusion
🔎 Similar Papers
No similar papers found.