π€ AI Summary
To address degraded non-native speech recognition performance caused by language confusion in code-switched speech, this paper proposes a fine-grained language-aware enhancement method tailored for the Whisper architecture. The method comprises two key components: (1) an encoder refinement module to strengthen intra-sentence language-switch modeling; and (2) a dual-path language prompt embedding adapter with inter-layer feature fusion, enabling layer-wise language-specific representation learning and dynamic integration in the decoder. This work is the first to achieve encoderβdecoderεε fine-grained language discrimination and adaptive fusion within Whisper. Evaluated on the SEAME dataset, the proposed approach yields substantial improvements in mixed-error rate (MER): β4.1% on dev_man and β7.2% on dev_sge, outperforming the current state-of-the-art methods.
π Abstract
Code-switching (CS) automatic speech recognition (ASR) faces challenges due to the language confusion resulting from accents, auditory similarity, and seamless language switches. Adaptation on the pre-trained multi-lingual model has shown promising performance for CS-ASR. In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. First, we propose an encoder refiner to enhance the encoder's capacity of intra-sentence swithching. Second, we propose using two sets of language-aware adapters with different language prompt embeddings to achieve language-specific decoding information in each decoder layer. Then, a fusion module is added to fuse the language-aware decoding. The experimental results using the SEAME dataset show that, compared with the baseline model, the proposed approach achieves a relative MER reduction of 4.1% and 7.2% on the dev_man and dev_sge test sets, respectively, surpassing state-of-the-art methods. Through experiments, we found that the proposed method significantly improves the performance on non-native language in CS speech, indicating that our approach enables Whisper to better distinguish between the two languages.