đ€ AI Summary
Subjective listening tests remain the bottleneck for evaluating speech quality of neural audio codecs at low bitrates. Method: We systematically benchmark mainstream objective metricsâincluding PESQ, STOI, and DNSMOSâagainst human perception using standardized MUSHRA subjective test results, quantifying their correlation with mean opinion scores via Pearsonâs correlation coefficient. Contribution/Results: Traditional metrics (e.g., PESQ) exhibit markedly degraded performance under neural codec distortions, whereas DNSMOS and novel time-frequency domain metrics achieve superior correlation (r > 0.85). We are the first to characterize differential sensitivity of objective metrics to neural-specific artifactsâsuch as spectral smearing and temporal aliasingâand to propose an empirically grounded, optimized metric combination with clearly defined applicability boundaries for neural audio codecs. This work provides evidence-based guidelines for automated, reproducible speech quality assessment in neural codec development and evaluation.
đ Abstract
Neural audio codecs have gained recent popularity for their use in generative modeling as they offer high-fidelity audio reconstruction at low bitrates. While human listening studies remain the gold standard for assessing perceptual quality, they are time-consuming and impractical. In this work, we examine the reliability of existing objective quality metrics in assessing the performance of recent neural audio codecs. To this end, we conduct a MUSHRA listening test on high-fidelity speech signals and analyze the correlation between subjective scores and widely used objective metrics. Our results show that, while some metrics align well with human perception, others struggle to capture relevant distortions. Our findings provide practical guidance for selecting appropriate evaluation metrics when using neural audio codecs for speech.