MAPSS: Manifold-based Assessment of Perceptual Source Separation

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Objective source separation evaluation exhibits significant misalignment with human perceptual judgment, particularly when leakage and self-distortion are coupled—lacking decoupled, differentiable quantitative metrics. This work proposes two differentiable perceptual metrics: Perceptual Separation (PS) and Perceptual Masking (PM), the first to functionally disentangle leakage from self-distortion. PS and PM leverage waveform representations extracted from a pretrained self-supervised model, construct a manifold structure via diffusion mapping, and compute Mahalanobis distances between outputs and target/non-target clusters. To enhance robustness, non-asymptotic confidence intervals and deterministic error radii are incorporated. Evaluated on multilingual speech–music mixtures, the joint PS–PM assessment achieves a linear correlation of 87.21% with subjective scores (95% CI = 12.21%, error radius = 1.39%), substantially outperforming existing metrics.

Technology Category

Application Category

📝 Abstract
Objective assessment of source-separation systems still mismatches subjective human perception, especially when leakage and self-distortion interact. We introduce the Perceptual Separation (PS) and Perceptual Match (PM), the first pair of measures that functionally isolate these two factors. Our intrusive method begins with generating a bank of fundamental distortions for each reference waveform signal in the mixture. Distortions, references, and their respective system outputs from all sources are then independently encoded by a pre-trained self-supervised learning model. These representations are aggregated and projected onto a manifold via diffusion maps, which aligns Euclidean distances on the manifold with dissimilarities of the encoded waveforms. On this manifold, the PM measures the Mahalanobis distance from each output to its attributed cluster that consists of its reference and distortions embeddings, capturing self-distortion. The PS accounts for the Mahalanobis distance of the output to the attributed and to the closest non-attributed clusters, quantifying leakage. Both measures are differentiable and granular, operating at a resolution as low as 50 frames per second. We further derive, for both measures, deterministic error radius and non-asymptotic, high-probability confidence intervals (CIs). Experiments on English, Spanish, and music mixtures show that the PS and PM nearly always achieve the highest linear correlation coefficients with human mean-opinion scores than 14 competitors, reaching as high as 86.36% for speech and 87.21% for music. We observe, at worst, an error radius of 1.39% and a probabilistic 95% CI of 12.21% for these coefficients, which improves reliable and informed evaluation. Using mutual information, the measures complement each other most as their values decrease, suggesting they are jointly more informative as system performance degrades.
Problem

Research questions and friction points this paper is trying to address.

Assessing perceptual mismatch in source separation systems
Isolating leakage and self-distortion factors in separation
Developing differentiable measures correlating with human perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained self-supervised model encodes waveform representations
Diffusion maps project embeddings onto perceptual manifold
Mahalanobis distances measure self-distortion and leakage separately
🔎 Similar Papers
No similar papers found.