Voice Adaptation for Swiss German

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses end-to-end text-to-speech synthesis and personalized voice cloning for low-resource Swiss German dialects, bridging standard German text to authentic dialectal speech. Leveraging approximately 5,000 hours of Swiss podcast audio—automatically transcribed and weakly supervised with dialect labels—we systematically fine-tune the XTTSv2 model, achieving the first end-to-end Swiss German adaptation covering multiple dialect variants. Our methodology comprises large-scale dialectal audio preprocessing, lightweight dialect-aware fine-tuning, and joint human evaluation (CMOS/SMOS) with automated metrics. Human evaluation yields a CMOS of −0.28 and an SMOS of 3.8, indicating significant improvements in speech naturalness and dialect intelligibility. This work establishes a scalable technical paradigm for voice cloning in under-resourced languages and dialects.

Technology Category

Application Category

📝 Abstract
This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.
Problem

Research questions and friction points this paper is trying to address.

Adapting voice cloning for Swiss German dialects
Translating Standard German text to Swiss German speech
Improving voice models for underrepresented languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Preprocess Swiss podcast dataset automatically
Fine-tune XTTSv2 model for dialect adaptation
Achieve high CMOS and SMOS evaluation scores
🔎 Similar Papers
No similar papers found.