AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering

๐Ÿ“… 2026-01-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing evaluation metrics for text-to-audio generation predominantly rely on embedding similarity, which struggles to capture fine-grained semantic alignment and compositional reasoning capabilities, and exhibits limited correlation with human judgments. To address this, this work proposes AQAScoreโ€”an architecture-agnostic evaluation framework that introduces an audio question-answering mechanism for the first time in this domain. AQAScore leverages audio-perceptive large language models (ALLMs) to perform probabilistic semantic verification by computing the log-probability of โ€œYesโ€ responses to targeted semantic queries such as โ€œDoes the audio contain the content described in the text?โ€ Experiments demonstrate that AQAScore significantly outperforms similarity-based metrics like CLAPScore and generative prompting baselines across multiple benchmarks, achieving high agreement with human ratings and effectively supporting compositional reasoning evaluation, with performance scaling alongside ALLM capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a"Yes"answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.
Problem

Research questions and friction points this paper is trying to address.

text-to-audio generation
evaluation metrics
semantic alignment
compositional reasoning
audio question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

AQAScore
audio question answering
semantic alignment
audio-aware LLMs
text-to-audio evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.