🤖 AI Summary
This work addresses the extremely low-resource automatic speech recognition (ASR) challenge for Faetar—an endangered dialect lacking standardized orthography and existing corpora. We introduce the first ASR benchmark for Faetar, comprising 5 hours of manually transcribed audio and 20 hours of unlabeled field recordings. To overcome the zero-resource bottleneck, we propose *unsupervised continual pretraining*: adapting a multilingual self-supervised speech model (e.g., XLS-R) to the target dialect using abundant unlabeled speech, followed by phoneme-level modeling, forced alignment–based post-processing, and fine-grained evaluation. Our approach achieves a state-of-the-art phoneme error rate of 30.4% under stringent zero-resource conditions—marking the first empirical validation that self-supervised speech pretraining is both effective and scalable for truly zero-resource dialect ASR. The framework provides a reusable methodology for technological preservation of endangered languages.
📝 Abstract
We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. Faetar, a Franco-Provenc{c}al variety spoken primarily in Italy, has no standard orthography, has virtually no existing textual or speech resources other than what is included in the benchmark, and is quite different from other forms of Franco-Provenc{c}al. The corpus comes from field recordings, most of which are noisy, for which only 5 hrs have matching transcriptions, and for which forced alignment is of variable quality. The corpus contains an additional 20 hrs of unlabelled speech. We report baseline results from state-of-the-art multilingual speech foundation models with a best phone error rate of 30.4%, using a pipeline that continues pre-training on the foundation model using the unlabelled set.