SwissGPC v1.0 -- The Swiss German Podcasts Corpus

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing Swiss German speech corpora predominantly consist of controlled read-speech data, lacking large-scale spontaneous conversational resources—severely limiting progress in ASR, TTS, and dialect identification. To address this, we introduce SwissCord, the first medium-to-large-scale corpus of spontaneous Swiss German spoken language, covering seven major dialect regions plus Standard German, with nearly 5,000 hours of audio sourced exclusively from authentic podcasts and radio dialogues. We propose an automated weakly supervised annotation pipeline for efficient speaker diarization, audio segmentation, and coarse-grained transcription. The corpus includes fine-grained dialect distribution metadata, token-level statistics, and segment-level acoustic-linguistic feature analyses, alongside an open-source benchmark evaluation protocol. By shifting from controlled to naturally occurring speech, SwissCord overcomes a longstanding data bottleneck, substantially enhancing the modeling capability and real-world generalizability of dialectal speech technologies.

Technology Category

Application Category

📝 Abstract

We present SwissGPC v1.0, the first mid-to-large-scale corpus of spontaneous Swiss German speech, developed to support research in ASR, TTS, dialect identification, and related fields. The dataset consists of links to talk shows and podcasts hosted on Schweizer Radio und Fernsehen and YouTube, which contain approximately 5400 hours of raw audio. After segmentation and weak annotation, nearly 5000 hours of speech were retained, covering the seven major Swiss German dialect regions alongside Standard German. We describe the corpus construction methodology, including an automated annotation pipeline, and provide statistics on dialect distribution, token counts, and segmentation characteristics. Unlike existing Swiss German speech corpora, which primarily feature controlled speech, this corpus captures natural, spontaneous conversations, making it a valuable resource for real-world speech applications.

Problem

Research questions and friction points this paper is trying to address.

Creating first large corpus of spontaneous Swiss German speech

Supporting research in ASR, TTS, and dialect identification

Capturing natural conversations across seven major dialect regions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated annotation pipeline for corpus construction

Natural spontaneous conversations from podcasts

Coverage of seven major Swiss German dialects

🔎 Similar Papers

No similar papers found.