AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing evaluation metrics for text-to-audio generation predominantly rely on embedding similarity, which struggles to capture fine-grained semantic alignment and compositional reasoning capabilities, and exhibits limited correlation with human judgments. To address this, this work proposes AQAScore—an architecture-agnostic evaluation framework that introduces an audio question-answering mechanism for the first time in this domain. AQAScore leverages audio-perceptive large language models (ALLMs) to perform probabilistic semantic verification by computing the log-probability of “Yes” responses to targeted semantic queries such as “Does the audio contain the content described in the text?” Experiments demonstrate that AQAScore significantly outperforms similarity-based metrics like CLAPScore and generative prompting baselines across multiple benchmarks, achieving high agreement with human ratings and effectively supporting compositional reasoning evaluation, with performance scaling alongside ALLM capabilities.

Technology Category

Application Category

📝 Abstract

Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a"Yes"answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.

Problem

Research questions and friction points this paper is trying to address.

text-to-audio generation

evaluation metrics

semantic alignment

compositional reasoning

audio question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

AQAScore

audio question answering

semantic alignment