🤖 AI Summary
Audio datasets commonly suffer from off-topic samples, near-duplicates, and label noise—issues that severely degrade model performance. To address this, we propose the first unified data quality auditing framework tailored for audio, adapting SelfClean from computer vision to the audio domain. Leveraging self-supervised pretrained models (e.g., AST, BEATs), our method extracts robust audio representations and establishes a “representation → ranking” paradigm, enabling simultaneous detection and interpretable prioritization of all three data quality issues in a single pipeline. Evaluated on ESC-50, GTZAN, and an industrial private dataset, our approach significantly outperforms task-specific baselines across multiple ranking metrics (average +12.7% NDCG@10). Human review efficiency improves by 3.2×, substantially reducing annotation costs. The framework offers a scalable, plug-and-play solution for audio data governance.
📝 Abstract
Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review.