🤖 AI Summary
To address the challenge of evaluating speech separation in real-world scenarios—where reference audio and textual transcriptions are often unavailable—this paper proposes the first reference-free, text-free self-supervised joint evaluation framework that simultaneously predicts separation quality (SI-SNR) and intelligibility (WER). The method leverages self-supervised representations derived from both the mixture signal and separated tracks, jointly modeling SI-SNR and WER as regression tasks. Evaluated on the WHAMR! dataset, our framework achieves a WER estimation MAE of 17% and a Pearson correlation coefficient (PCC) of 0.77; for SI-SNR, it attains an MAE of 1.38 and a PCC of 0.95—substantially outperforming all baselines. Extensive experiments demonstrate strong robustness and high metric correlation, establishing a novel paradigm for unsupervised speech separation evaluation.
📝 Abstract
Source separation is a crucial pre-processing step for various speech processing tasks, such as automatic speech recognition (ASR). Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding transcriptions to assess audio quality and intelligibility. However, they cannot be used to evaluate real-world mixtures for which no reference exists. This paper introduces a text-free reference-free evaluation framework based on self-supervised learning (SSL) representations. The proposed framework utilize the mixture and separated tracks to predict jointly audio quality, through the Scale Invariant Signal to Noise Ratio (SI-SNR) metric, and speech intelligibility through the Word Error Rate (WER) metric. We conducted experiments on the WHAMR! dataset, which shows a WER estimation with a mean absolute error (MAE) of 17% and a Pearson correlation coefficient (PCC) of 0.77; and SI-SNR estimation with an MAE of 1.38 and PCC of 0.95. We further demonstrate the robustness of our estimator by using various SSL representations.