UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the limitations of existing speech generation evaluation methods, which either rely on costly and non-scalable subjective human ratings or employ automatic metrics with narrow task coverage and single-dimensional assessment. To overcome these challenges, the authors propose UniSRM, a unified speech reward model built upon the AudioLLM architecture. Leveraging a newly curated dataset (UniSRM-Data) and a comprehensive benchmark (UniSRM-Bench), UniSRM employs a two-stage inference-training strategy augmented with a reasoning consistency reward mechanism. As the first framework capable of fine-grained, interpretable, multi-task, and multi-dimensional speech quality evaluation, UniSRM significantly improves alignment between automatic scores and human judgments across diverse tasks, establishing a scalable foundation for unified assessment in speech generation.

📝 Abstract

Evaluating speech generation still relies heavily on human judgments, such as Mean Opinion Score (MOS), which are expensive, subjective, and difficult to reproduce at scale. While a few recent studies have begun to explore AudioLLM-based judge models, existing efforts typically target only a narrow set of scenarios (e.g., utterance-level quality or single-turn dialogue) and provide limited coverage of diverse speech generation tasks and evaluation dimensions. In this work, we propose UniSRM, a unified speech reward model that can support multi-dimensional, interpretable reward signals with reliable reasoning. To support training and evaluation, we introduce UniSRM-Data and UniSRM-Bench, covering speech evaluation tasks from utterance-level quality to context-level coherence. Based on this dataset, we present the unified speech reward model, UniSRM, with a two-stage pipeline that enables reasoning-based fine-grained assessment. Furthermore, we introduce Reasoning-Consistent Rewards to improve the reliability of the reasoning process. Experiments show that UniSRM delivers more reliable and human-aligned judgments across a broad range of speech evaluation tasks, offering a practical foundation for scalable and unified evaluation of speech quality.

Problem

Research questions and friction points this paper is trying to address.

speech evaluation

human judgment

AudioLLM

fine-grained assessment

scalable evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Speech Reward Model

Reasoning-Based Evaluation

Fine-grained Assessment