BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing no-reference audio captioning evaluation metrics (e.g., CLAPScore) suffer from a lack of ground-truth annotations and insufficient systematic robustness validation. Method: We introduce BRACE—the first robustness benchmark for audio captioning—comprising two fine-grained subsets: BRACE-Main (caption similarity ranking) and BRACE-Hallucination (hallucination detection), enabling evaluation of modality alignment for both Audio Captioning Evaluation Metrics (ACEM) and Large Audio-Language Models (LALMs). We propose an LLM-perturbation-augmented, multi-round expert-annotation data curation paradigm to precisely model subtle hallucinations, combining high-quality filtering with controlled text corruption. Contribution/Results: Extensive cross-model generalization evaluation across multiple CLAP variants and state-of-the-art LALMs reveals critical limitations: the best CLAP-based method achieves only 70.01 F1 on BRACE-Main, while the top LALM scores merely 63.19—exposing severe alignment bottlenecks. BRACE provides a reproducible, scalable, and rigorous evaluation standard for audio–language alignment research.

Technology Category

Application Category

📝 Abstract

Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.

Problem

Research questions and friction points this paper is trying to address.

Evaluates robustness of reference-free audio caption quality metrics

Assesses modality alignment in large audio-language models

Detects subtle hallucinated content in audio captions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces BRACE benchmark for reference-free audio caption evaluation

Uses LLM-based corruption and human annotation for dataset construction

Evaluates CLAPScore variants and Large Audio Language Models

🔎 Similar Papers

CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

2024-09-19arXiv.orgCitations: 3