π€ AI Summary
This work addresses the limited interpretability of traditional speech quality assessment, which relies heavily on mean opinion scores (MOS) and lacks fine-grained analysis of perceptual dimensions and audio artifacts. The authors propose a post-training framework that first aligns a large audio language model with predefined perceptual dimensions through a calibration phase, followed by a reinforcement learning stage based on Group Relative Policy Optimization (GRPO) with dimension-specific reward mechanisms. This integrated approach enables multi-dimensional reasoning, artifact detection, and temporal localization of degradations. To the best of the authorsβ knowledge, it is the first method to combine calibration and reinforcement learning for granular speech quality evaluation. Evaluated on the QualiSpeech benchmark, the proposed method achieves an average Pearson correlation coefficient (PCC) of 0.71, improves MOS prediction performance by 13%, and significantly enhances both artifact classification and temporal localization capabilities.
π Abstract
Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, detection and classification of audio artifacts. First, a calibration stage aligns the model to predict predefined perceptual dimensions. Second, a reinforcement learning stage leverages Group Relative Policy Optimization (GRPO) with dimension-specific rewards to heavily enhance accuracy of descriptions and temporal localization of quality issues. With this approach we reach state-of-the-art results of 0.71 mean PCC score on the multidimensional QualiSpeech benchmark and 13% improvement in MOS prediction driven by RL-based reasoning. Furthermore, our fine-grained GRPO rewards substantially advance the model's ability to pinpoint and classify audio artifacts in time.