🤖 AI Summary
To address the challenges of scarce speech data, complex morphology, and high out-of-vocabulary (OOV) rates in the endangered language SENĆOTEN—which severely limit automatic speech recognition (ASR) performance—this paper proposes a low-resource-adapted end-to-end speech transcription framework. The method integrates cross-lingual transfer learning, text-to-speech (TTS)-based data augmentation, and shallow fusion of a language model combining n-gram scoring with n-best re-ranking. This approach effectively mitigates OOV recognition difficulties and accommodates lexical dynamism. On the SENĆOTEN test set, the system achieves a word error rate (WER) of 14.32% and a character error rate (CER) of 3.45%, with an OOV word recognition accuracy of 26.48%. This work establishes the first practical ASR pipeline for SENĆOTEN and provides a transferable modeling paradigm for low-resource endangered languages, supporting digital archiving, pedagogical resource generation, and language revitalization efforts.
📝 Abstract
The SENĆOTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts to turn the tide of language loss as a result of colonial language policies. To support these on-the-ground efforts, the community is turning to digital technology. Automatic Speech Recognition (ASR) technology holds great promise for accelerating language documentation and the creation of educational resources. However, developing ASR systems for SENĆOTEN is challenging due to limited data and significant vocabulary variation from its polysynthetic structure and stress-driven metathesis. To address these challenges, we propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech (TTS) system and cross-lingual transfer learning with Speech Foundation Models (SFMs). An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data. Experiments on the SENĆOTEN dataset show a word error rate (WER) of 19.34% and a character error rate (CER) of 5.09% on the test set with a 57.02% out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors, WER improves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating the potential of our ASR-driven pipeline to support SENĆOTEN language documentation.