CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing accent normalization systems often suffer from unnatural outputs or content distortion due to scarce training data and rigid duration modeling. To address these limitations, this work proposes a “source synthesis” training strategy that pairs synthetic non-native speech with real native speech, enabling effective training without any authentic L2 data. Furthermore, we introduce CosyAccent, a non-autoregressive model that integrates implicit prosody modeling with explicit total duration control. This approach effectively mitigates TTS artifacts and, remarkably, outperforms strong baselines trained on real non-native speech—despite using no genuine L2 utterances—achieving significant improvements in both content fidelity and speech naturalness.

Technology Category

Application Category

📝 Abstract

Accent normalization (AN) systems often struggle with unnatural outputs and undesired content distortion, stemming from both suboptimal training data and rigid duration modeling. In this paper, we propose a "source-synthesis" methodology for training data construction. By generating source L2 speech and using authentic native speech as the training target, our approach avoids learning from TTS artifacts and, crucially, requires no real L2 data in training. Alongside this data strategy, we introduce CosyAccent, a non-autoregressive model that resolves the trade-off between prosodic naturalness and duration control. CosyAccent implicitly models rhythm for flexibility yet offers explicit control over total output duration. Experiments show that, despite being trained without any real L2 speech, CosyAccent achieves significantly improved content preservation and superior naturalness compared to strong baselines trained on real-world data.

Problem

Research questions and friction points this paper is trying to address.

accent normalization

unnatural output

content distortion

duration modeling

training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

source-synthesis

accent normalization

duration control