Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio-language models exhibit insufficient reasoning capabilities and lack unified evaluation benchmarks for multi-domain audio question answering (AQA). Method: We introduce the first structured, acoustics-oriented multi-domain AQA benchmark, comprising three subtasks: bioacoustic identification, temporal soundscape understanding, and complex logical reasoning. Our framework enables cross-acoustic-domain evaluation and incorporates answer re-ranking robustness assessment. We conduct systematic evaluations using Qwen2-Audio-7B, AudioFlamingo 2, and Gemini-2-Flash. Contribution/Results: The benchmark spans diverse acoustic semantics—from marine mammal vocalizations to real-world urban soundscapes—leveraging multi-source heterogeneous data and a top-1 accuracy protocol. Development-set results reveal significant inter-domain performance disparities, establishing a new standard for fine-grained, reproducible evaluation of acoustic reasoning in audio-language models.

Technology Category

Application Category

📝 Abstract
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.
Problem

Research questions and friction points this paper is trying to address.

Advance audio understanding in multi-domain QA tasks
Test audio-language models on diverse acoustic scenes
Improve AI's perception of real-world sound interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-domain audio QA benchmark
Three QA subsets testing audio-language models
Baseline systems include Qwen2-Audio-7B
🔎 Similar Papers
No similar papers found.
Chao-Han Huck Yang
Chao-Han Huck Yang
Sr. Research Scientist, NVIDIA Research
Robust Speech RecognitionLanguage ModelsPost-TrainingSequence Modeling
Sreyan Ghosh
Sreyan Ghosh
Ph.D. in CS at University of Maryland, College Park
AIMachine LearningNLPSpeech Recognition
Q
Qing Wang
University of Science and Technology of China
J
Jaeyeon Kim
Seoul National University
H
Hengyi Hong
University of Science and Technology of China
S
Sonal Kumar
University of Maryland, College Park
G
Guirui Zhong
University of Science and Technology of China
Zhifeng Kong
Zhifeng Kong
Senior Research Scientist, NVIDIA
Deep Generative ModelsDiffusion ModelsAudio Foundation ModelsAudio LMTrustworthy ML
S Sakshi
S Sakshi
Ph.D. in CS at University of Maryland, College Park
Machine LearningNatural Language ProcessingAudio Processing
V
Vaibhavi Lokegaonkar
University of Maryland, College Park
Oriol Nieto
Oriol Nieto
Senior Research Engineer II at Adobe
Audio ProcessingGenerative AIMusic Information RetrievalRecommender Systems
Ramani Duraiswami
Ramani Duraiswami
Computer Science and UMIACS, University of Maryland
Scientific ComputingSpatial AudioMachine LearningComputational Electromagnetics
Dinesh Manocha
Dinesh Manocha
Distinguished University Professor, University of Maryland at College Park
computer graphicsgeometric modelingmotion planningvirtual realityrobotics
Gunhee Kim
Gunhee Kim
Professor, Seoul National University
Computer VisionMachine LearningNatural Language Processing
J
Jun Du
University of Science and Technology of China
Rafael Valle
Rafael Valle
NVIDIA, UC Berkeley, CNMAT
Machine Listening and Improvisation
Bryan Catanzaro
Bryan Catanzaro
NVIDIA
Parallel ComputingMachine Learning