The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language

📅 2024-09-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the extremely low-resource automatic speech recognition (ASR) challenge for Faetar—an endangered dialect lacking standardized orthography and existing corpora. We introduce the first ASR benchmark for Faetar, comprising 5 hours of manually transcribed audio and 20 hours of unlabeled field recordings. To overcome the zero-resource bottleneck, we propose *unsupervised continual pretraining*: adapting a multilingual self-supervised speech model (e.g., XLS-R) to the target dialect using abundant unlabeled speech, followed by phoneme-level modeling, forced alignment–based post-processing, and fine-grained evaluation. Our approach achieves a state-of-the-art phoneme error rate of 30.4% under stringent zero-resource conditions—marking the first empirical validation that self-supervised speech pretraining is both effective and scalable for truly zero-resource dialect ASR. The framework provides a reusable methodology for technological preservation of endangered languages.

Technology Category

Application Category

📝 Abstract
We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. Faetar, a Franco-Provenc{c}al variety spoken primarily in Italy, has no standard orthography, has virtually no existing textual or speech resources other than what is included in the benchmark, and is quite different from other forms of Franco-Provenc{c}al. The corpus comes from field recordings, most of which are noisy, for which only 5 hrs have matching transcriptions, and for which forced alignment is of variable quality. The corpus contains an additional 20 hrs of unlabelled speech. We report baseline results from state-of-the-art multilingual speech foundation models with a best phone error rate of 30.4%, using a pipeline that continues pre-training on the foundation model using the unlabelled set.
Problem

Research questions and friction points this paper is trying to address.

Dialect Language Recognition
Limited Learning Resources
Faetar Dialect
Innovation

Methods, ideas, or system contributions that make the work stand out.

Faetar
Limited-resource Languages
Multilingual Speech Recognition
🔎 Similar Papers
No similar papers found.
M
Michael Ong
Dept. of Linguistics, University of Toronto, Toronto, Canada
S
Sean Robertson
Depts of French and Computer Science, University of Toronto, Toronto, Canada
L
Leo Peckham
Depts of Linguistics and Computer Science, University of Toronto, Toronto, Ontario
A
Alba Jorquera Jimenez de Aberasturi
Dept. of Linguistics, University of Toronto, Toronto, Canada
P
Paula Arkhangorodsky
Dept. of Linguistics, University of Toronto, Toronto, Canada
R
Robin Huo
Depts of Linguistics and Computer Science, University of Toronto, Toronto, Canada
A
Aman Sakhardande
Dept. of Linguistics, University of Toronto, Toronto, Canada
M
Mark Hallap
Dept. of Philosophy, University of Toronto, Toronto, Canada
N
Naomi Nagy
Dept. of Linguistics, University of Toronto, Toronto, Canada
Ewan Dunbar
Ewan Dunbar
University of Toronto
linguisticsphonologystatisticscomputational modelinglanguage acquisition