Which Data Matter? Embedding-Based Data Selection for Speech Recognition

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenges of domain-specific automatic speech recognition (ASR) models in effectively leveraging large-scale heterogeneous data and mitigating train-test condition mismatch. To this end, the authors propose a multidimensional embedding–based data selection strategy that evaluates 100,000 hours of in-the-wild speech data along three dimensions—speaker attributes, phonetic content, and semantic information—to assess both relevance and diversity. The top 5% of the data, as selected by this criterion, is used to train a CTC-based Conformer ASR model. Experimental results demonstrate that, on the target domain, the model trained exclusively on this curated subset achieves up to a 36.8% relative reduction in word error rate (WER) compared to a model trained on the full dataset, substantially improving data efficiency and validating the effectiveness of the proposed multidimensional embedding approach.

Technology Category

Application Category

📝 Abstract

Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic--speaker attributes, phonetic content, and semantic meaning--and analyze how relevance and diversity along these axes when performing data selection affect downstream ASR performance. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.

Problem

Research questions and friction points this paper is trying to address.

data selection

speech recognition

domain adaptation

ASR

training data mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

embedding-based data selection

domain-specific ASR

speech embeddings