AURA Score: A Metric For Holistic Audio Question Answering Evaluation

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio question answering (AQA) evaluation metrics—such as BLEU, METEOR, and BERTScore—rely on surface-level text matching and fail to account for question context, logical reasoning, or partial correctness, limiting their reliability for open-ended responses. Method: We introduce AQEval, the first systematic AQA benchmark comprising 10,000 model responses with multi-annotator human judgments, and propose AURA Score—a novel metric integrating context awareness, semantic understanding, and reasoning-path alignment to explicitly model answer plausibility and question relevance. Contribution/Results: Experiments demonstrate that AURA Score achieves significantly higher correlation with human judgments than all baseline metrics, especially for lengthy answers. Both AQEval and AURA Score are publicly released, establishing a new standard for evaluating audio-language models.

Technology Category

Application Category

📝 Abstract
Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface similarity and fail to account for question context, reasoning, and partial correctness. To address the gap in literature, we make three contributions in this work. First, we introduce AQEval to enable systematic benchmarking of AQA metrics. It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance. Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment, especially for longer answers. Third, we propose a new metric - AURA score, to better evaluate open-ended model responses. On AQEval, AURA achieves state-of-the-art correlation with human ratings, significantly outperforming all baselines. Through this work, we aim to highlight the limitations of current AQA evaluation methods and motivate better metrics. We release both the AQEval benchmark and the AURA metric to support future research in holistic AQA evaluation.
Problem

Research questions and friction points this paper is trying to address.

Existing AQA metrics fail to evaluate reasoning and partial correctness
Current evaluation methods show weak correlation with human judgment
No holistic benchmark exists for systematic AQA metric evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces AQEval benchmark for systematic AQA evaluation
Proposes AURA score metric for open-ended audio responses
Achieves state-of-the-art correlation with human ratings
🔎 Similar Papers
No similar papers found.