Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Multilingual ASR suffers from performance degradation due to substantial linguistic diversity and imbalanced training data, as well as high operational costs and misclassification errors induced by reliance on commercial language identification (LID) models. To address these challenges, we propose an adaptive selective invocation mechanism grounded in spoken-language large language models (SLLMs). Our method introduces a novel, speech-difficulty-aware dynamic decision paradigm that invokes state-of-the-art (SOTA) ASR models only when necessary. It is the first to jointly leverage SLLMs for both difficulty assessment and lightweight end-to-end transcription, enabling co-optimization of accuracy and efficiency. The approach integrates multilingual speech feature modeling with low-overhead inference design. Experiments across three multilingual benchmarks demonstrate a 18.7% relative reduction in word error rate over a pure-SLLM baseline and a 50% decrease in invocation cost compared to conventional LID-based routing—significantly enhancing the cost-effectiveness and scalability of multilingual ASR systems.

Technology Category

Application Category

📝 Abstract

Although multilingual automatic speech recognition (ASR) systems have significantly advanced, enabling a single model to handle multiple languages, inherent linguistic differences and data imbalances challenge SOTA performance across all languages. While language identification (LID) models can route speech to the appropriate ASR model, they incur high costs from invoking SOTA commercial models and suffer from inaccuracies due to misclassification. To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. Built on a spoken large language model (SLLM), SIMA evaluates whether the input is simple enough for direct transcription or requires the invocation of a SOTA ASR model. Our approach reduces word error rates by 18.7% compared to the SLLM and halves invocation costs compared to LID-based methods. Tests on three datasets show that SIMA is a scalable, cost-effective solution for multilingual ASR applications.

Problem

Research questions and friction points this paper is trying to address.

Reducing multilingual ASR costs by selective model invocation

Improving accuracy by adapting to speech difficulty levels

Overcoming language imbalances and misclassification in ASR

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses SLLM to assess speech difficulty

Selectively invokes SOTA ASR when needed

Reduces costs and improves accuracy

🔎 Similar Papers

Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection