The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language

📅 2024-09-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the extremely low-resource automatic speech recognition (ASR) challenge for Faetar—an endangered dialect lacking standardized orthography and existing corpora. We introduce the first ASR benchmark for Faetar, comprising 5 hours of manually transcribed audio and 20 hours of unlabeled field recordings. To overcome the zero-resource bottleneck, we propose *unsupervised continual pretraining*: adapting a multilingual self-supervised speech model (e.g., XLS-R) to the target dialect using abundant unlabeled speech, followed by phoneme-level modeling, forced alignment–based post-processing, and fine-grained evaluation. Our approach achieves a state-of-the-art phoneme error rate of 30.4% under stringent zero-resource conditions—marking the first empirical validation that self-supervised speech pretraining is both effective and scalable for truly zero-resource dialect ASR. The framework provides a reusable methodology for technological preservation of endangered languages.

Technology Category

Application Category

📝 Abstract

We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. Faetar, a Franco-Provenc{c}al variety spoken primarily in Italy, has no standard orthography, has virtually no existing textual or speech resources other than what is included in the benchmark, and is quite different from other forms of Franco-Provenc{c}al. The corpus comes from field recordings, most of which are noisy, for which only 5 hrs have matching transcriptions, and for which forced alignment is of variable quality. The corpus contains an additional 20 hrs of unlabelled speech. We report baseline results from state-of-the-art multilingual speech foundation models with a best phone error rate of 30.4%, using a pipeline that continues pre-training on the foundation model using the unlabelled set.

Problem

Research questions and friction points this paper is trying to address.

Dialect Language Recognition

Limited Learning Resources

Faetar Dialect

Innovation

Methods, ideas, or system contributions that make the work stand out.

Faetar

Limited-resource Languages

Multilingual Speech Recognition

🔎 Similar Papers

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement