🤖 AI Summary
Multilingual ASR suffers from performance degradation due to substantial linguistic diversity and imbalanced training data, as well as high operational costs and misclassification errors induced by reliance on commercial language identification (LID) models. To address these challenges, we propose an adaptive selective invocation mechanism grounded in spoken-language large language models (SLLMs). Our method introduces a novel, speech-difficulty-aware dynamic decision paradigm that invokes state-of-the-art (SOTA) ASR models only when necessary. It is the first to jointly leverage SLLMs for both difficulty assessment and lightweight end-to-end transcription, enabling co-optimization of accuracy and efficiency. The approach integrates multilingual speech feature modeling with low-overhead inference design. Experiments across three multilingual benchmarks demonstrate a 18.7% relative reduction in word error rate over a pure-SLLM baseline and a 50% decrease in invocation cost compared to conventional LID-based routing—significantly enhancing the cost-effectiveness and scalability of multilingual ASR systems.
📝 Abstract
Although multilingual automatic speech recognition (ASR) systems have significantly advanced, enabling a single model to handle multiple languages, inherent linguistic differences and data imbalances challenge SOTA performance across all languages. While language identification (LID) models can route speech to the appropriate ASR model, they incur high costs from invoking SOTA commercial models and suffer from inaccuracies due to misclassification. To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. Built on a spoken large language model (SLLM), SIMA evaluates whether the input is simple enough for direct transcription or requires the invocation of a SOTA ASR model. Our approach reduces word error rates by 18.7% compared to the SLLM and halves invocation costs compared to LID-based methods. Tests on three datasets show that SIMA is a scalable, cost-effective solution for multilingual ASR applications.