Representation-Based Data Quality Audits for Audio

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio datasets commonly suffer from off-topic samples, near-duplicates, and label noise—issues that severely degrade model performance. To address this, we propose the first unified data quality auditing framework tailored for audio, adapting SelfClean from computer vision to the audio domain. Leveraging self-supervised pretrained models (e.g., AST, BEATs), our method extracts robust audio representations and establishes a “representation → ranking” paradigm, enabling simultaneous detection and interpretable prioritization of all three data quality issues in a single pipeline. Evaluated on ESC-50, GTZAN, and an industrial private dataset, our approach significantly outperforms task-specific baselines across multiple ranking metrics (average +12.7% NDCG@10). Human review efficiency improves by 3.2×, substantially reducing annotation costs. The framework offers a scalable, plug-and-play solution for audio data governance.

Technology Category

Application Category

📝 Abstract
Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review.
Problem

Research questions and friction points this paper is trying to address.

Detecting off-topic samples and near duplicates
Identifying label errors in audio datasets
Unifying data quality audits using audio representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts SelfClean framework to audio domain
Uses self-supervised representations for data auditing
Creates unified ranking system for multiple issues
🔎 Similar Papers
No similar papers found.