ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating open-ended audio question answering (QA) for large audio-language models (LALMs) faces challenges including high inter-annotator disagreement, ambiguity in partial correctness, and the inadequacy of scalar scores to capture uncertainty. Method: We propose the first automated evaluation framework that explicitly models judgment uncertainty. Specifically, (1) we introduce Beta-distributed modeling of answer correctness—jointly estimating expected correctness and its uncertainty—and (2) design a three-stage human-in-the-loop annotation paradigm integrating structured human feedback with iterative refinement. Results: Evaluated on 3,580 audio QA pairs, our framework achieves Krippendorff’s alpha = 0.82 and Spearman correlation = 0.91—significantly outperforming LLM-based judges—while reducing computational overhead substantially. This work establishes a new, interpretable, robust, and low-resource paradigm for evaluating open-generative audio understanding.

Technology Category

Application Category

📝 Abstract
Evaluating open-ended responses from large audio language models (LALMs) is challenging because human annotators often genuinely disagree on answer correctness due to multiple valid interpretations, partial correctness, and subjective judgment. Traditional metrics reporting only mean scores fail to capture this uncertainty. We present ORCA (Open-ended Response Correctness Assessment), a framework that models the variability in human judgments using Beta distributions to predict both expected correctness and uncertainty. Our three-stage annotation framework combines human judgment with structured feedback and iterative refinement to simultaneously curate training data and improve benchmark quality. We collected 11,721 annotations across 3,580 question-answer pairs from 15 LALMs on two audio QA benchmarks, achieving inter-annotator agreement of 0.82 (Krippendorff's alpha). ORCA achieves 0.91 Spearman correlation with mean human judgments, matching or outperforming LLM-judge baselines while providing uncertainty estimates and requiring significantly less compute. We release our models, code, and curated dataset.
Problem

Research questions and friction points this paper is trying to address.

Assessing correctness of open-ended audio QA responses with human disagreement.
Modeling judgment variability and uncertainty using Beta distributions.
Creating a framework to improve benchmark quality and training data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Beta distributions model human judgment variability
Three-stage annotation combines human and structured feedback
Achieves high correlation with human judgments efficiently
🔎 Similar Papers
No similar papers found.
Š
Šimon Sedláček
Speech@FIT, Brno University of Technology, Czechia
S
Sara Barahona
Universidad Autónoma de Madrid, Spain
Bolaji Yusuf
Bolaji Yusuf
Researcher, Brno University of Technology
Speech recognitionSpoken term detection
L
Laura Herrera-Alarcón
Universidad Autónoma de Madrid, Spain
S
S. Kesiraju
Speech@FIT, Brno University of Technology, Czechia
C
Cecilia Bolaños
University of Buenos Aires, Argentina
Alicia Lozano-Diez
Alicia Lozano-Diez
Universidad Autonoma de Madrid (UAM)
Machine learningdeep neural networks (DNN)language and speaker recognition
S
Sathvik Udupa
Speech@FIT, Brno University of Technology, Czechia
Fernando López
Fernando López
INO
Infrared ImagingNDESignal ProcessingTerahertz ImagingThermal Sciences
A
Allison Ferner
Tufts University, USA
R
R. Duraiswami
University of Maryland, USA
J
Jan Černocký
Speech@FIT, Brno University of Technology, Czechia