refess-qi: reference-free evaluation for speech separation with joint quality and intelligibility scoring

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of evaluating speech separation in real-world scenarios—where reference audio and textual transcriptions are often unavailable—this paper proposes the first reference-free, text-free self-supervised joint evaluation framework that simultaneously predicts separation quality (SI-SNR) and intelligibility (WER). The method leverages self-supervised representations derived from both the mixture signal and separated tracks, jointly modeling SI-SNR and WER as regression tasks. Evaluated on the WHAMR! dataset, our framework achieves a WER estimation MAE of 17% and a Pearson correlation coefficient (PCC) of 0.77; for SI-SNR, it attains an MAE of 1.38 and a PCC of 0.95—substantially outperforming all baselines. Extensive experiments demonstrate strong robustness and high metric correlation, establishing a novel paradigm for unsupervised speech separation evaluation.

Technology Category

Application Category

📝 Abstract
Source separation is a crucial pre-processing step for various speech processing tasks, such as automatic speech recognition (ASR). Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding transcriptions to assess audio quality and intelligibility. However, they cannot be used to evaluate real-world mixtures for which no reference exists. This paper introduces a text-free reference-free evaluation framework based on self-supervised learning (SSL) representations. The proposed framework utilize the mixture and separated tracks to predict jointly audio quality, through the Scale Invariant Signal to Noise Ratio (SI-SNR) metric, and speech intelligibility through the Word Error Rate (WER) metric. We conducted experiments on the WHAMR! dataset, which shows a WER estimation with a mean absolute error (MAE) of 17% and a Pearson correlation coefficient (PCC) of 0.77; and SI-SNR estimation with an MAE of 1.38 and PCC of 0.95. We further demonstrate the robustness of our estimator by using various SSL representations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating speech separation without reference audio or transcriptions
Predicting audio quality and intelligibility jointly using SSL representations
Assessing real-world mixtures where no ground truth exists
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-free evaluation using self-supervised learning representations
Jointly predicts audio quality and speech intelligibility metrics
Leverages mixture and separated tracks without reference signals