CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition

📅 2024-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe language confusion and insufficient cross-lingual representation fusion in Mandarin–English code-switching speech recognition, this paper proposes a Mixture-of-Experts (MoE) architecture enhanced with cross-attention and a language bias modeling mechanism. Our method introduces cross-attention between MoE layers to enable fine-grained alignment and fusion of language-specific representations; additionally, we design a source attention mechanism that dynamically injects outputs from a language discriminator into text embeddings, explicitly modeling code-switching boundaries and contextual language dependencies. The model is trained end-to-end for joint optimization of language discrimination and ASR. Experiments on SEAME, ASRU200, and ASRU700+LibriSpeech460 demonstrate state-of-the-art performance, with significant improvements in cross-lingual modeling capability and recognition robustness.

Technology Category

Application Category

📝 Abstract
Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains two or more languages accurately. To better capture language-specific speech representations and address language confusion in code-switching ASR, the mixture-of-experts (MoE) architecture and an additional language diarization (LD) decoder are commonly employed. However, most researches remain stagnant in simple operations like weighted summation or concatenation to fuse languagespecific speech representations, leaving significant opportunities to explore the enhancement of integrating language bias information. In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. Specifically, after each MoE layer, we fuse language-specific speech representations with cross-attention, leveraging its strong contextual modeling abilities. Additionally, we design a source attention-based mechanism to incorporate the language information from the LD decoder output into text embeddings. Experimental results demonstrate that our approach achieves state-of-the-art performance on the SEAME, ASRU200, and ASRU700+LibriSpeech460 Mandarin-English code-switching ASR datasets.
Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition (ASR)
Code Switching
Multilingual Context Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Attention Mixed Experts (MoE) Structure
Source-Attention Based Linguistic Bias Mechanism
Code-Switching Automatic Speech Recognition
🔎 Similar Papers
No similar papers found.
H
He Wang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
X
Xucheng Wan
IT Innovation and Research Center, Huawei Technologies, Shenzhen, China
N
Naijun Zheng
IT Innovation and Research Center, Huawei Technologies, Shenzhen, China
K
Kai Liu
IT Innovation and Research Center, Huawei Technologies, Shenzhen, China
Huan Zhou
Huan Zhou
Northwestern Polytechnical University
Mobile Edge ComputingFederated LearningMobile Social NetworksVANETsData Offloading
G
Guojian Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China