🤖 AI Summary
This study addresses end-to-end text-to-speech synthesis and personalized voice cloning for low-resource Swiss German dialects, bridging standard German text to authentic dialectal speech. Leveraging approximately 5,000 hours of Swiss podcast audio—automatically transcribed and weakly supervised with dialect labels—we systematically fine-tune the XTTSv2 model, achieving the first end-to-end Swiss German adaptation covering multiple dialect variants. Our methodology comprises large-scale dialectal audio preprocessing, lightweight dialect-aware fine-tuning, and joint human evaluation (CMOS/SMOS) with automated metrics. Human evaluation yields a CMOS of −0.28 and an SMOS of 3.8, indicating significant improvements in speech naturalness and dialect intelligibility. This work establishes a scalable technical paradigm for voice cloning in under-resourced languages and dialects.
📝 Abstract
This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.