Calibration-Reasoning Framework for Descriptive Speech Quality Assessment

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limited interpretability of traditional speech quality assessment, which relies heavily on mean opinion scores (MOS) and lacks fine-grained analysis of perceptual dimensions and audio artifacts. The authors propose a post-training framework that first aligns a large audio language model with predefined perceptual dimensions through a calibration phase, followed by a reinforcement learning stage based on Group Relative Policy Optimization (GRPO) with dimension-specific reward mechanisms. This integrated approach enables multi-dimensional reasoning, artifact detection, and temporal localization of degradations. To the best of the authors’ knowledge, it is the first method to combine calibration and reinforcement learning for granular speech quality evaluation. Evaluated on the QualiSpeech benchmark, the proposed method achieves an average Pearson correlation coefficient (PCC) of 0.71, improves MOS prediction performance by 13%, and significantly enhances both artifact classification and temporal localization capabilities.

Technology Category

Application Category

📝 Abstract

Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, detection and classification of audio artifacts. First, a calibration stage aligns the model to predict predefined perceptual dimensions. Second, a reinforcement learning stage leverages Group Relative Policy Optimization (GRPO) with dimension-specific rewards to heavily enhance accuracy of descriptions and temporal localization of quality issues. With this approach we reach state-of-the-art results of 0.71 mean PCC score on the multidimensional QualiSpeech benchmark and 13% improvement in MOS prediction driven by RL-based reasoning. Furthermore, our fine-grained GRPO rewards substantially advance the model's ability to pinpoint and classify audio artifacts in time.

Problem

Research questions and friction points this paper is trying to address.

speech quality assessment

perceptual dimensions

audio artifacts

explainability

temporal localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Calibration-Reasoning Framework

Audio Large Language Model

Multidimensional Speech Quality Assessment