🤖 AI Summary
Existing no-reference audio captioning evaluation metrics (e.g., CLAPScore) suffer from a lack of ground-truth annotations and insufficient systematic robustness validation. Method: We introduce BRACE—the first robustness benchmark for audio captioning—comprising two fine-grained subsets: BRACE-Main (caption similarity ranking) and BRACE-Hallucination (hallucination detection), enabling evaluation of modality alignment for both Audio Captioning Evaluation Metrics (ACEM) and Large Audio-Language Models (LALMs). We propose an LLM-perturbation-augmented, multi-round expert-annotation data curation paradigm to precisely model subtle hallucinations, combining high-quality filtering with controlled text corruption. Contribution/Results: Extensive cross-model generalization evaluation across multiple CLAP variants and state-of-the-art LALMs reveals critical limitations: the best CLAP-based method achieves only 70.01 F1 on BRACE-Main, while the top LALM scores merely 63.19—exposing severe alignment bottlenecks. BRACE provides a reproducible, scalable, and rigorous evaluation standard for audio–language alignment research.
📝 Abstract
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated.
To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation.
Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs.
Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19.
By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.