🤖 AI Summary
Existing Swiss German speech corpora predominantly consist of controlled read-speech data, lacking large-scale spontaneous conversational resources—severely limiting progress in ASR, TTS, and dialect identification. To address this, we introduce SwissCord, the first medium-to-large-scale corpus of spontaneous Swiss German spoken language, covering seven major dialect regions plus Standard German, with nearly 5,000 hours of audio sourced exclusively from authentic podcasts and radio dialogues. We propose an automated weakly supervised annotation pipeline for efficient speaker diarization, audio segmentation, and coarse-grained transcription. The corpus includes fine-grained dialect distribution metadata, token-level statistics, and segment-level acoustic-linguistic feature analyses, alongside an open-source benchmark evaluation protocol. By shifting from controlled to naturally occurring speech, SwissCord overcomes a longstanding data bottleneck, substantially enhancing the modeling capability and real-world generalizability of dialectal speech technologies.
📝 Abstract
We present SwissGPC v1.0, the first mid-to-large-scale corpus of spontaneous Swiss German speech, developed to support research in ASR, TTS, dialect identification, and related fields. The dataset consists of links to talk shows and podcasts hosted on Schweizer Radio und Fernsehen and YouTube, which contain approximately 5400 hours of raw audio. After segmentation and weak annotation, nearly 5000 hours of speech were retained, covering the seven major Swiss German dialect regions alongside Standard German. We describe the corpus construction methodology, including an automated annotation pipeline, and provide statistics on dialect distribution, token counts, and segmentation characteristics. Unlike existing Swiss German speech corpora, which primarily feature controlled speech, this corpus captures natural, spontaneous conversations, making it a valuable resource for real-world speech applications.