🤖 AI Summary
This work addresses the challenge of developing an efficient and low-cost automatic speech recognition (ASR) system for Singapore’s multilingual setting—encompassing English, Mandarin, Malay, and Tamil—by proposing a balanced sampling fine-tuning strategy that operates without explicit language labels. The approach enables end-to-end training of Qwen3-ASR-0.6B/1.7B models to perform implicit language identification and transcription jointly. The resulting compact multilingual ASR system, Polyglot-Lion-1.7B, achieves an average word error rate of 14.85% across 12 benchmarks, with a remarkably low training cost of just \$81 on a single GPU and an inference speed of 0.10 seconds per sample—approximately 20 times faster than MERaLiON. Despite its significantly reduced model size and computational overhead, the system approaches the performance of large, specialized ASR systems.
📝 Abstract
We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \$81 on a single RTX PRO 6000 GPU compared to \$18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.