Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study investigates how training data configuration affects zero-shot accent robustness in low-resource automatic speech recognition (ASR). We conduct systematic, controlled experiments on a unified benchmark to quantitatively assess the impact of three factors: number of speakers, audio duration per speaker, and accent diversity. Results show that, under fixed total training duration, increasing speaker count significantly improves cross-accent generalization, whereas extending per-speaker utterance length yields diminishing returns. When speaker count is held constant, accent diversity contributes only marginal gains—challenging the common “accent-prioritized sampling” assumption. Our key contribution is establishing speaker count as the most critical factor for zero-shot accent robustness, providing empirically grounded, actionable guidance for data collection and allocation in ASR development for new languages.

Technology Category

Application Category

📝 Abstract

To build an automatic speech recognition (ASR) system that can serve everyone in the world, the ASR needs to be robust to a wide range of accents including unseen accents. We systematically study how three different variables in training data -- the number of speakers, the audio duration per each individual speaker, and the diversity of accents -- affect ASR robustness towards unseen accents in a low-resource training regime. We observe that for a fixed number of ASR training hours, it is more beneficial to increase the number of speakers (which means each speaker contributes less) than the number of hours contributed per speaker. We also observe that more speakers enables ASR performance gains from scaling number of hours. Surprisingly, we observe minimal benefits to prioritizing speakers with different accents when the number of speakers is controlled. Our work suggests that practitioners should prioritize increasing the speaker count in ASR training data composition for new languages.

Problem

Research questions and friction points this paper is trying to address.

Study impact of speaker count on ASR accent robustness

Examine effect of training duration per speaker

Assess accent diversity's role in low-resource ASR

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritize increasing speaker count in training

Balance speaker duration and accent diversity

Optimize low-resource ASR for unseen accents

🔎 Similar Papers

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation