Towards Unsupervised Speech Recognition at the Syllable-Level

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address unsupervised automatic speech recognition (UASR) for low-resource languages lacking paired speech-text data, this paper proposes the first syllable-level UASR framework. It abandons conventional phoneme modeling reliant on grapheme-to-phoneme (G2P) converters and avoids the training instability of generative adversarial networks (GANs), thereby improving generalization to languages with ambiguous phoneme boundaries. Methodologically, the approach jointly leverages self-supervised speech representations and unsupervised text learning, achieving cross-modal syllable-level alignment via masked language modeling (MLM) and enabling end-to-end training. On LibriSpeech, it reduces character error rate by up to 40% relatively over prior UASR methods and—crucially—achieves the first successful transfer to Mandarin speech recognition, substantially outperforming existing UASR approaches. The core contributions are (i) a novel syllable-level modeling paradigm and (ii) a scalable cross-modal alignment mechanism applicable to unpaired multimodal data.

Technology Category

Application Category

📝 Abstract

Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Developing unsupervised speech recognition without paired data

Overcoming phone-based methods' reliance on costly G2P converters

Addressing training instability in languages with ambiguous phoneme boundaries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses syllable-level unsupervised speech recognition framework

Employs masked language modeling for training stability

Eliminates need for grapheme-to-phoneme converters

🔎 Similar Papers

Towards Unsupervised Speech Recognition Without Pronunciation Models