MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large audio language models exhibit limited performance on multi-audio understanding tasks, with accuracy degrading significantly as the number of concurrent audio sources increases. To address this gap, this work introduces the first comprehensive benchmark for multi-audio understanding that encompasses speech, general audio, and music, enabling the first systematic evaluation of model limitations in such complex auditory scenarios. The study further proposes two training-free inference strategies—Audio-Permutational Self-Consistency and a Chain-of-Thought–based approach—to achieve robust aggregation of predictions from multiple audio inputs. Experimental results demonstrate that these methods yield absolute accuracy improvements of up to 6.28% and 6.74%, respectively, thereby revealing critical shortcomings of existing models in complex auditory comprehension.

Technology Category

Application Category

📝 Abstract
While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.
Problem

Research questions and friction points this paper is trying to address.

multi-audio understanding
large audio-language models
audio benchmark
input scaling
auditory comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-audio understanding
large audio-language models
Audio-Permutational Self-Consistency
input scaling bottleneck
training-free strategy
🔎 Similar Papers
No similar papers found.
Chih-Kai Yang
Chih-Kai Yang
National Taiwan University
Deep LearningSpeech ProcessingNatural Language ProcessingMachine Learning
Y
Yun-Shao Tsai
National Taiwan University, Taiwan
Y
Yu-Kai Guo
National Taiwan University, Taiwan
P
Ping-Le Tsai
National Taiwan University, Taiwan
Y
Yen-Ting Piao
National Taiwan University, Taiwan
H
Hung-Wei Chen
National Taiwan University, Taiwan
T
Ting-Lin Hsiao
National Taiwan University, Taiwan
Y
Yun-Man Hsu
National Taiwan University, Taiwan
Ke-Han Lu
Ke-Han Lu
National Taiwan University
Nature Language ProcessingSpeech Recognition
Hung-yi Lee
Hung-yi Lee
National Taiwan University
deep learningspoken language understandingspeech processing