🤖 AI Summary
Over 7,000 languages worldwide lack automatic speech recognition (ASR) support—especially low-resource, long-tail languages—due to limitations in architectural scalability, prohibitive data acquisition costs, and ethical risks. Method: We propose the first scalable, zero-shot multilingual ASR architecture, built upon a 7B-parameter self-supervised encoder-decoder framework, augmented with a large language model–inspired decoder that enables robust cross-lingual generalization from minimal speech data. Contribution/Results: The system supports 1,600+ languages—including over 500 previously unsupported—using publicly available data and community-driven multilingual speech corpora. Experiments demonstrate substantial gains over state-of-the-art methods in ultra-low-resource settings. We open-source a family of models ranging from 300M to 7B parameters, optimized for both edge-device deployment and high-accuracy applications.
📝 Abstract
Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.