🤖 AI Summary
This study investigates the representational capacity of self-supervised speech models for Dutch, systematically comparing Dutch-specific pretraining, English monolingual pretraining, and multilingual pretraining. Using the Wav2Vec 2.0 architecture, we employ clustering probes, classification probes, and zero-shot evaluation to quantitatively assess how well models encode Dutch phonemic and morphological features. Our key finding—first empirically demonstrated—is that Dutch-specific pretraining substantially enhances the decodability of Dutch phonetic and lexical representations, outperforming both English monolingual pretraining of comparable scale and larger-scale multilingual pretraining; this gain strongly correlates with improved downstream automatic speech recognition (ASR) performance. These results establish that language-specific pretraining provides an indispensable enhancement to phoneme–morpheme representation for the target language, offering critical methodological guidance for speech modeling in low-resource languages.
📝 Abstract
How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.