🤖 AI Summary
German multilectal ASR research is hindered by scarce dialectal speech data and the absence of robust, standardized evaluation benchmarks. To address this, we introduce Betthupferl—the first publicly available speech-to-dual-transcription dataset covering three Southeast German dialects (Franconian, Bavarian, Alemannic) alongside Standard German, enabling both dialect identification and end-to-end dialect-to-Standard German speech translation. We propose a linguistically grounded, controllable normalization evaluation protocol that quantifies dialectal retention versus grammatical standardization. Using state-of-the-art multilingual models—including Whisper and SeamlessM4T—we conduct systematic benchmarking. Results reveal substantial inconsistency in grammatical normalization: while some outputs approximate Standard German, most retain dialect-specific syntactic and morphological structures. This work delivers a reproducible multilectal ASR and speech translation benchmark, diagnostic error analysis tools, and a linguistics-informed evaluation framework for dialectal language processing.
📝 Abstract
Although Germany has a diverse landscape of dialects, they are underrepresented in current automatic speech recognition (ASR) research. To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany (Franconian, Bavarian, Alemannic), and half an hour of Standard German speech. We provide both dialectal and Standard German transcriptions, and analyze the linguistic differences between them. We benchmark several multilingual state-of-the-art ASR models on speech translation into Standard German, and find differences between how much the output resembles the dialectal vs. standardized transcriptions. Qualitative error analyses of the best ASR model reveal that it sometimes normalizes grammatical differences, but often stays closer to the dialectal constructions.