Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multilingual ASR suffers from performance degradation due to substantial linguistic diversity and imbalanced training data, as well as high operational costs and misclassification errors induced by reliance on commercial language identification (LID) models. To address these challenges, we propose an adaptive selective invocation mechanism grounded in spoken-language large language models (SLLMs). Our method introduces a novel, speech-difficulty-aware dynamic decision paradigm that invokes state-of-the-art (SOTA) ASR models only when necessary. It is the first to jointly leverage SLLMs for both difficulty assessment and lightweight end-to-end transcription, enabling co-optimization of accuracy and efficiency. The approach integrates multilingual speech feature modeling with low-overhead inference design. Experiments across three multilingual benchmarks demonstrate a 18.7% relative reduction in word error rate over a pure-SLLM baseline and a 50% decrease in invocation cost compared to conventional LID-based routing—significantly enhancing the cost-effectiveness and scalability of multilingual ASR systems.

Technology Category

Application Category

📝 Abstract
Although multilingual automatic speech recognition (ASR) systems have significantly advanced, enabling a single model to handle multiple languages, inherent linguistic differences and data imbalances challenge SOTA performance across all languages. While language identification (LID) models can route speech to the appropriate ASR model, they incur high costs from invoking SOTA commercial models and suffer from inaccuracies due to misclassification. To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. Built on a spoken large language model (SLLM), SIMA evaluates whether the input is simple enough for direct transcription or requires the invocation of a SOTA ASR model. Our approach reduces word error rates by 18.7% compared to the SLLM and halves invocation costs compared to LID-based methods. Tests on three datasets show that SIMA is a scalable, cost-effective solution for multilingual ASR applications.
Problem

Research questions and friction points this paper is trying to address.

Reducing multilingual ASR costs by selective model invocation
Improving accuracy by adapting to speech difficulty levels
Overcoming language imbalances and misclassification in ASR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses SLLM to assess speech difficulty
Selectively invokes SOTA ASR when needed
Reduces costs and improves accuracy
H
Hongfei Xue
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, China
Y
Yufeng Tang
ByteDance, China
J
Jun Zhang
ByteDance, China
Xuelong Geng
Xuelong Geng
School of Computer Science, Northwestern Polytechnical University
ASRLLMspeech
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, China