Towards Unsupervised Speech Recognition Without Pronunciation Models

📅 2024-06-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Low-resource languages suffer from severe scarcity of speech-text parallel data, pronunciation dictionaries, and manually annotated word boundaries. Method: We propose the first truly word-level end-to-end unsupervised automatic speech recognition (ASR) framework, eliminating reliance on phoneme modeling and external lexicons. Our approach jointly performs masked token infilling (MTI) on both speech and text modalities, integrated with self-supervised contrastive learning and iterative word segmentation structure refinement—requiring neither parallel corpora, nor prior word boundary annotations, nor pronunciation models. Contribution/Results: Evaluated on standard low-resource benchmarks, our method achieves word error rates (WER) of 20–23%, substantially outperforming existing dictionary-free unsupervised ASR approaches. This work establishes, for the first time, the feasibility and effectiveness of fully unsupervised, word-level ASR.

Technology Category

Application Category

📝 Abstract

Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20-23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.

Problem

Research questions and friction points this paper is trying to address.

Speech Recognition

Resource-poor Languages

Training without Parallel Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised Speech Recognition

Word-level Modeling

Missing Information Training

🔎 Similar Papers

Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision