🤖 AI Summary
Current speech separation research suffers from methodological fragmentation and a lack of systematic, standardized evaluation. To address this, we present the first comprehensive survey and empirical analysis of deep neural network–based speech separation techniques. We propose a unified modeling framework that systematically encompasses known/unknown speaker scenarios, supervised-to-self-supervised paradigms, and encoder–separator–decoder architectural components. Under controlled experimental conditions, we conduct fair, quantitative benchmarking of over 30 state-of-the-art models on standard datasets, rigorously characterizing their performance ceilings and robustness limitations. Based on these findings, we identify and articulate four key frontiers: domain-adaptive robustness, lightweight and efficient architectures, audio-visual multimodal integration, and novel self-supervised paradigms leveraging mask-based reconstruction and contrastive learning. This work fills a critical gap in systematic benchmarking and delivers a reproducible, principle-driven technical roadmap—advancing speech separation from ad hoc model aggregation toward theoretically grounded, paradigmatic progress.
📝 Abstract
The field of speech separation, addressing the "cocktail party problem", has seen revolutionary advances with DNNs. Speech separation enhances clarity in complex acoustic environments and serves as crucial pre-processing for speech recognition and speaker recognition. However, current literature focuses narrowly on specific architectures or isolated approaches, creating fragmented understanding. This survey addresses this gap by providing systematic examination of DNN-based speech separation techniques. Our work differentiates itself through: (I) Comprehensive perspective: We systematically investigate learning paradigms, separation scenarios with known/unknown speakers, comparative analysis of supervised/self-supervised/unsupervised frameworks, and architectural components from encoders to estimation strategies. (II) Timeliness: Coverage of cutting-edge developments ensures access to current innovations and benchmarks. (III) Unique insights: Beyond summarization, we evaluate technological trajectories, identify emerging patterns, and highlight promising directions including domain-robust frameworks, efficient architectures, multimodal integration, and novel self-supervised paradigms. (IV) Fair evaluation: We provide quantitative evaluations on standard datasets, revealing true capabilities and limitations of different methods. This comprehensive survey serves as an accessible reference for experienced researchers and newcomers navigating speech separation's complex landscape.