🤖 AI Summary
This work addresses the limitations of existing speech generation evaluation methods, which either rely on costly and non-scalable subjective human ratings or employ automatic metrics with narrow task coverage and single-dimensional assessment. To overcome these challenges, the authors propose UniSRM, a unified speech reward model built upon the AudioLLM architecture. Leveraging a newly curated dataset (UniSRM-Data) and a comprehensive benchmark (UniSRM-Bench), UniSRM employs a two-stage inference-training strategy augmented with a reasoning consistency reward mechanism. As the first framework capable of fine-grained, interpretable, multi-task, and multi-dimensional speech quality evaluation, UniSRM significantly improves alignment between automatic scores and human judgments across diverse tasks, establishing a scalable foundation for unified assessment in speech generation.
📝 Abstract
Evaluating speech generation still relies heavily on human judgments, such as Mean Opinion Score (MOS), which are expensive, subjective, and difficult to reproduce at scale. While a few recent studies have begun to explore AudioLLM-based judge models, existing efforts typically target only a narrow set of scenarios (e.g., utterance-level quality or single-turn dialogue) and provide limited coverage of diverse speech generation tasks and evaluation dimensions. In this work, we propose UniSRM, a unified speech reward model that can support multi-dimensional, interpretable reward signals with reliable reasoning. To support training and evaluation, we introduce UniSRM-Data and UniSRM-Bench, covering speech evaluation tasks from utterance-level quality to context-level coherence. Based on this dataset, we present the unified speech reward model, UniSRM, with a two-stage pipeline that enables reasoning-based fine-grained assessment. Furthermore, we introduce Reasoning-Consistent Rewards to improve the reliability of the reasoning process. Experiments show that UniSRM delivers more reliable and human-aligned judgments across a broad range of speech evaluation tasks, offering a practical foundation for scalable and unified evaluation of speech quality.