Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the scarcity of high-quality automatic speech recognition (ASR) models for under-resourced languages on resource-constrained edge devices, this paper proposes a monolingual, ultra-lightweight ASR modeling paradigm. Departing from conventional multilingual joint modeling, our approach employs a compact neural architecture with only 27 million parameters and trains it exclusively on monolingual data—comprising high-fidelity human annotations, trusted pseudo-labels, and controllably synthesized speech. Experiments across six low-resource languages demonstrate that our model achieves an average word error rate (WER) 48% lower than Whisper Tiny, significantly outperforms Whisper Small (which has 9× more parameters), and matches or approaches the performance of Whisper Medium (28× larger) on most languages. All code, models, and data are fully open-sourced. This work establishes a practical, efficient, and scalable paradigm for deploying accurate ASR on edge devices for low-resource languages.

Technology Category

Application Category

📝 Abstract

We present the Flavors of Moonshine, a suite of tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages. Prevailing wisdom suggests that multilingual ASR models outperform monolingual counterparts by exploiting cross-lingual phonetic similarities. We challenge this assumption, showing that for sufficiently small models (27M parameters), training monolingual systems on a carefully balanced mix of high-quality human-labeled, pseudo-labeled, and synthetic data yields substantially superior performance. On average, our models achieve error rates 48% lower than the comparably sized Whisper Tiny model, outperform the 9x larger Whisper Small model, and in most cases match or outperform the 28x larger Whisper Medium model. These results advance the state of the art for models of this size, enabling accurate on-device ASR for languages that previously had limited support. We release Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese Moonshine models under a permissive open-source license.

Problem

Research questions and friction points this paper is trying to address.

Develop tiny ASR models for underrepresented languages

Challenge multilingual superiority in small model scenarios

Enable accurate on-device speech recognition with limited support

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monolingual training with balanced data mix

Small 27M parameter ASR models

Superior performance over larger multilingual models

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Software Engineer Intern (AI Model Optimization) - 2026 Summer (BS/MS)

TikTok

Seattle, Washington

Authors to Follow