VM-UNSSOR: Unsupervised Neural Speech Separation Enhanced by Higher-SNR Virtual Microphone Arrays

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In unsupervised neural speech separation (UNSSOR), reducing the number of microphones weakens the mixture consistency (MC) constraint, leading to a sharp degradation in separation performance. Method: This paper proposes a signal enhancement approach based on a high-SNR virtual microphone array: high-fidelity virtual channels are generated via linear spatial demixing techniques—such as independent vector analysis (IVA) or spatial clustering—and subsequently refined through end-to-end unsupervised training with deep neural networks and an explicit MC loss. Contribution/Results: This work introduces, for the first time, a virtual microphone signal enhancement mechanism that simultaneously alleviates frequency-wise permutation ambiguity and significantly improves robustness under low-channel conditions. On the SMS-WSJ dataset, the method achieves 17.1 dB SI-SDR for six-microphone, two-speaker scenarios—outperforming UNSSOR by 2.4 dB—and dramatically improves performance in two-microphone settings from −2.7 dB to 10.7 dB, demonstrating both effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract
Blind speech separation (BSS) aims to recover multiple speech sources from multi-channel, multi-speaker mixtures under unknown array geometry and room impulse responses. In unsupervised setup where clean target speech is not available for model training, UNSSOR proposes a mixture consistency (MC) loss for training deep neural networks (DNN) on over-determined training mixtures to realize unsupervised speech separation. However, when the number of microphones of the training mixtures decreases, the MC constraint weakens and the separation performance falls dramatically. To address this, we propose VM-UNSSOR, augmenting the observed training mixture signals recorded by a limited number of microphones with several higher-SNR virtual-microphone (VM) signals, which are obtained by applying linear spatial demixers (such as IVA and spatial clustering) to the observed training mixtures. As linear projections of the observed mixtures, the virtual-microphone signals can typically increase the SNR of each source and can be leveraged to compute extra MC losses to improve UNSSOR and address the frequency permutation problem in UNSSOR. On the SMS-WSJ dataset, in the over-determined six-microphone, two-speaker separation setup, VM-UNSSOR reaches 17.1 dB SI-SDR, while UNSSOR only obtains 14.7 dB; and in the determined two-microphone, two-speaker case, UNSSOR collapses to -2.7 dB SI-SDR, while VM-UNSSOR achieves 10.7 dB.
Problem

Research questions and friction points this paper is trying to address.

Enhancing unsupervised speech separation with virtual microphone arrays
Addressing performance degradation with limited physical microphones
Solving frequency permutation problem in blind source separation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances unsupervised speech separation with virtual microphones
Uses linear spatial demixers to create higher-SNR signals
Adds mixture consistency losses from virtual microphone arrays
🔎 Similar Papers
No similar papers found.