Voice Adaptation for Swiss German

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses end-to-end text-to-speech synthesis and personalized voice cloning for low-resource Swiss German dialects, bridging standard German text to authentic dialectal speech. Leveraging approximately 5,000 hours of Swiss podcast audio—automatically transcribed and weakly supervised with dialect labels—we systematically fine-tune the XTTSv2 model, achieving the first end-to-end Swiss German adaptation covering multiple dialect variants. Our methodology comprises large-scale dialectal audio preprocessing, lightweight dialect-aware fine-tuning, and joint human evaluation (CMOS/SMOS) with automated metrics. Human evaluation yields a CMOS of −0.28 and an SMOS of 3.8, indicating significant improvements in speech naturalness and dialect intelligibility. This work establishes a scalable technical paradigm for voice cloning in under-resourced languages and dialects.

Technology Category

Application Category

📝 Abstract

This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.

Problem

Research questions and friction points this paper is trying to address.

Adapting voice cloning for Swiss German dialects

Translating Standard German text to Swiss German speech

Improving voice models for underrepresented languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Preprocess Swiss podcast dataset automatically

Fine-tune XTTSv2 model for dialect adaptation

Achieve high CMOS and SMOS evaluation scores

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs