🤖 AI Summary
To address unsupervised automatic speech recognition (UASR) for low-resource languages lacking paired speech-text data, this paper proposes the first syllable-level UASR framework. It abandons conventional phoneme modeling reliant on grapheme-to-phoneme (G2P) converters and avoids the training instability of generative adversarial networks (GANs), thereby improving generalization to languages with ambiguous phoneme boundaries. Methodologically, the approach jointly leverages self-supervised speech representations and unsupervised text learning, achieving cross-modal syllable-level alignment via masked language modeling (MLM) and enabling end-to-end training. On LibriSpeech, it reduces character error rate by up to 40% relatively over prior UASR methods and—crucially—achieves the first successful transfer to Mandarin speech recognition, substantially outperforming existing UASR approaches. The core contributions are (i) a novel syllable-level modeling paradigm and (ii) a scalable cross-modal alignment mechanism applicable to unpaired multimodal data.
📝 Abstract
Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.