Contrastive Regularization for Accent-Robust ASR

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

122K/year

🤖 AI Summary

This work addresses the significant performance degradation of automatic speech recognition (ASR) systems when encountering unseen or non-native accents. To enhance robustness, the authors propose a lightweight, model-agnostic regularization method that requires no accent labels: during CTC fine-tuning, supervised contrastive learning (SupCon) is introduced to refine the geometric structure of encoder representations via utterance-level contrastive loss. The approach operates solely with a self-supervised pre-trained acoustic model and a standard CTC framework, without architectural modifications or explicit accent annotations. Evaluated on the L2-ARCTIC benchmark, the method achieves up to a 29% relative reduction in word error rate and demonstrates that the learned representations are more compact and stable under accent variation, substantially improving ASR generalization to unseen accents.

📝 Abstract

ASR systems based on self-supervised acoustic pretraining and CTC fine-tuning achieve strong performance on native speech but remain sensitive to accent variability. We investigate supervised contrastive learning (SupCon) as a lightweight, accent-invariant auxiliary objective for CTC fine-tuning. An utterance-level contrastive loss regularizes encoder representations without architectural modification or explicit accent supervision. Experiments on the L2-ARCTIC benchmark show consistent WER reductions across multiple pretrained encoders, with up to 25 -- 29\% relative reduction under unseen-accent evaluation. Analysis using within-transcript cosine dispersion indicates that SupCon promotes more compact and stable representation geometry under accent variability. Overall, SupCon provides an effective and model-agnostic regularization strategy for improving accent robustness.

Problem

Research questions and friction points this paper is trying to address.

accent robustness

automatic speech recognition

accent variability

self-supervised pretraining

CTC fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

supervised contrastive learning

accent robustness

CTC fine-tuning