Unleashing the Power of Natural Audio Featuring Multiple Sound Sources

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current sound separation methods rely heavily on synthetically mixed training data, leading to limited generalization in real-world, naturally mixed acoustic scenarios. To address this, we propose ClearSep—a novel framework featuring a remix-based consistency dual-evaluation metric that drives joint optimization of separation and self-supervised distillation. ClearSep introduces an iterative data engine integrating self-supervised knowledge distillation, dynamic pseudo-label generation, and time-frequency masking-based separation. By enforcing remix consistency and adaptive thresholding, the framework enables customized training for individual source tracks. Evaluated across multiple benchmarks, ClearSep achieves state-of-the-art performance, significantly improving separation quality, robustness, and cross-domain generalization on naturally mixed audio—thereby overcoming the generalization bottleneck imposed by artificial mixing.

Technology Category

Application Category

📝 Abstract
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose ClearSep, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios. We introduce two remix-based evaluation metrics to quantitatively assess separation quality and use these metrics as thresholds to iteratively apply the data engine alongside model training, progressively optimizing separation performance. In addition, we propose a series of training strategies tailored to these separated independent tracks to make the best use of them. Extensive experiments demonstrate that ClearSep achieves state-of-the-art performance across multiple sound separation tasks, highlighting its potential for advancing sound separation in natural audio scenarios. For more examples and detailed results, please visit our demo page at https://clearsep.github.io.
Problem

Research questions and friction points this paper is trying to address.

Universal sound separation from mixed audio
Generalization to real-world natural audio
Quantitative evaluation of separation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

ClearSep framework for natural audio separation
Data engine decomposes complex mixed audio
Remix-based metrics optimize separation performance
🔎 Similar Papers
No similar papers found.
X
Xize Cheng
Zhejiang University, Hangzhou, China
S
Slytherin Wang
Independent Researcher
Z
Zehan Wang
Zhejiang University, Hangzhou, China
Rongjie Huang
Rongjie Huang
FAIR, Zhejiang University
Multimedia ComputingSpeechNatural Language Processing
T
Tao Jin
Zhejiang University, Hangzhou, China
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing