Representation-Based Data Quality Audits for Audio

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Audio datasets commonly suffer from off-topic samples, near-duplicates, and label noise—issues that severely degrade model performance. To address this, we propose the first unified data quality auditing framework tailored for audio, adapting SelfClean from computer vision to the audio domain. Leveraging self-supervised pretrained models (e.g., AST, BEATs), our method extracts robust audio representations and establishes a “representation → ranking” paradigm, enabling simultaneous detection and interpretable prioritization of all three data quality issues in a single pipeline. Evaluated on ESC-50, GTZAN, and an industrial private dataset, our approach significantly outperforms task-specific baselines across multiple ranking metrics (average +12.7% NDCG@10). Human review efficiency improves by 3.2×, substantially reducing annotation costs. The framework offers a scalable, plug-and-play solution for audio data governance.

Technology Category

Application Category

📝 Abstract

Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review.

Problem

Research questions and friction points this paper is trying to address.

Detecting off-topic samples and near duplicates

Identifying label errors in audio datasets

Unifying data quality audits using audio representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts SelfClean framework to audio domain

Uses self-supervised representations for data auditing

Creates unified ranking system for multiple issues

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Staff Data Engineer (Audio/ML)

Disney

The hiring range for this position in Nicasio, CA is $170,500 to $228,600 per year. The base pay actually offered will take into account internal equity and also may vary depending on the candidate’s geographic region, job-related knowledge, skills, and experience among other factors. A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits, dependent on the level and position offered.

Nicasio, CA, USA

Authors to Follow