From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional speech quality assessment relies on subjective Mean Opinion Score (MOS) ratings, suffering from high annotation costs, inconsistent criteria, and poor reproducibility. To address these limitations, this work introduces MOS-RMBench—a unified preference benchmark that systematically reformulates heterogeneous MOS datasets into pairwise preference comparison tasks for the first time. We further propose a MOS-aware Generative Reward Model (GRM), which adaptively models reward signals using MOS differences, thereby significantly enhancing fine-grained discrimination—especially for ambiguous or hard-to-judge samples. Within a preference learning framework, we comparatively evaluate scalar, semi-scalar, and generative reward models. Experiments show that scalar models achieve the highest overall accuracy (>74%), while the MOS-aware GRM substantially narrows the performance gap on challenging samples. Our approach establishes a more robust and scalable paradigm for synthetic speech quality assessment.

Technology Category

Application Category

📝 Abstract
Assessing the perceptual quality of synthetic speech is crucial for guiding the development and refinement of speech generation models. However, it has traditionally relied on human subjective ratings such as the Mean Opinion Score (MOS), which depend on manual annotations and often suffer from inconsistent rating standards and poor reproducibility. To address these limitations, we introduce MOS-RMBench, a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting, enabling rigorous evaluation across different datasets. Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling: scalar reward models, semi-scalar reward models, and generative reward models (GRMs). Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences. To improve performance on these challenging pairs, we propose a MOS-aware GRM that incorporates an MOS-difference-based reward function, enabling the model to adaptively scale rewards according to the difficulty of each sample pair. Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases. We hope this work will establish both a benchmark and a methodological framework to foster more rigorous and scalable research in automatic speech quality assessment.
Problem

Research questions and friction points this paper is trying to address.

Replaces subjective MOS ratings with preference-based benchmarking for speech quality
Systematically evaluates three reward modeling paradigms for speech quality assessment
Improves fine-grained discrimination on challenging speech pairs with MOS-aware modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulating MOS datasets into preference-comparison setting
Systematically constructing scalar and generative reward models
Proposing MOS-aware GRM with adaptive reward scaling
🔎 Similar Papers
No similar papers found.
Y
Yifei Cao
Fudan University
C
Changhao Jiang
Fudan University
J
Jiabao Zhuang
Fudan University
J
Jiajun Sun
Fudan University
M
Ming Zhang
Fudan University
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
H
Hui Li
Fudan University
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
Y
Yuran Wang
Honor Device Co., Ltd
Y
Yunke Zhang
Honor Device Co., Ltd
Tao Ji
Tao Ji
中国人民大学
T
Tao Gui
Fudan University
Q
Qi Zhang
Fudan University
X
Xuanjing Huang
Fudan University