🤖 AI Summary
This work addresses the lack of low-latency streaming solutions in Thai automatic speech recognition (ASR), where existing systems predominantly rely on high-latency offline large models. The authors propose a 115M-parameter FastConformer-Transducer architecture, enhanced with context-aware Thai text normalization and a two-stage curriculum learning strategy to improve alignment between training objectives and evaluation metrics while effectively handling both Central Thai and Isan dialects. The proposed method achieves accuracy comparable to Whisper Large-v3 while reducing computational cost by a factor of 45. Additionally, the study introduces and open-sources the standardized Typhoon ASR Benchmark, establishing the first dedicated evaluation suite for streaming Thai ASR and filling a critical gap in benchmarking resources for this language.
📝 Abstract
Large encoder-decoder models like Whisper achieve strong offline transcription but remain impractical for streaming applications due to high latency. However, due to the accessibility of pre-trained checkpoints, the open Thai ASR landscape remains dominated by these offline architectures, leaving a critical gap in efficient streaming solutions. We present Typhoon ASR Real-time, a 115M-parameter FastConformer-Transducer model for low-latency Thai speech recognition. We demonstrate that rigorous text normalization can match the impact of model scaling: our compact model achieves a 45x reduction in computational cost compared to Whisper Large-v3 while delivering comparable accuracy. Our normalization pipeline resolves systemic ambiguities in Thai transcription --including context-dependent number verbalization and repetition markers (mai yamok) --creating consistent training targets. We further introduce a two-stage curriculum learning approach for Isan (north-eastern) dialect adaptation that preserves Central Thai performance. To address reproducibility challenges in Thai ASR, we release the Typhoon ASR Benchmark, a gold-standard human-labeled datasets with transcriptions following established Thai linguistic conventions, providing standardized evaluation protocols for the research community.