🤖 AI Summary
This work addresses the pervasive issue of multi-source acoustic mixtures in existing audio event datasets—such as FSD50K—which significantly hinders the performance of single-source modeling. To mitigate this limitation, we propose the first automated data-cleaning framework that integrates a generative diffusion model with a discriminative classifier. The approach leverages the diffusion model to synthesize clean, single-class audio samples and generate controlled noisy mixtures, which are then used in conjunction with a pretrained audio encoder and classifier to identify and remove multi-source samples from the original dataset. This methodology establishes a scalable, open-source paradigm for audio data purification and yields FSD50K-Solo, a high-quality single-source subset. Evaluated on an expert-annotated test set, FSD50K-Solo demonstrates substantially improved data purity, offering a more reliable training resource for audio event detection.
📝 Abstract
High-quality training datasets are essential for the performance of neural networks. However, the audio domain still lacks a large-scale, strongly-labeled, and single-source sound event dataset. The FSD50K dataset, despite being relatively large and open, contains a considerable fraction of multi-source samples where background interference or overlapping events could limit the usefulness of the data. To address this challenge, we introduce a data curation framework designed for large-scale open audio corpora. Our approach leverages a generative diffusion model to synthesize clean single-class events to construct controlled noisy mixtures for supervision. We subsequently employ a pre-trained audio encoder coupled with a discriminative classifier to automatically identify and filter out multi-source samples. Experiments show that our framework achieves strong performance on a human expert-curated test set. Finally, we release FSD50K-Solo, a model-curated subset of FSD50K containing single-source audio samples identified by our method. Beyond FSD50K, our method establishes a scalable paradigm for curating open source audio corpora.