🤖 AI Summary
This study addresses the scarcity of dialectal speech resources, which significantly limits the performance of speech technologies on non-standard language varieties. For the first time, a multi-speaker speech corpus for Saarbrücken German is constructed, comprising six hours of aligned audio–text data recorded by nine native speakers. Through systematic collection of dialectal texts, manual recording, forced alignment, grapheme-to-phoneme (G2P) modeling, and analysis of pronunciation variation, the project effectively tackles the challenges posed by orthographic–phonetic mismatches in low-resource dialects. The resulting corpus provides a crucial data foundation for zero-shot and few-shot text-to-speech (TTS) synthesis research on dialects and sheds light on core difficulties and viable approaches in low-resource dialect modeling.
📝 Abstract
Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbrücken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset's characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus provides aligned textual and audio representations. This serves as a foundation for future research on dialect-aware text-to-speech (TTS), particularly in low-resource scenarios, including zero-shot and few-shot model adaptation.