Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores

📅 2024-06-06
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address strong cross-lingual interference and difficulty in modeling bilingual code-switching in zero-shot Mandarin–English mixed-language ASR, this paper proposes a kNN-CTC framework featuring dual monolingual datastores and a gating selection mechanism. Our method dynamically routes each frame to the language-specific datastore at the frame level, enabling language-aware contextual enhancement; it further employs monolingual embedding indexing with CTC alignment constraints to eliminate reliance on mixed-language data. Crucially, we depart from conventional bilingual-aligned datastore designs by decoupling language representations via the gating mechanism, thereby significantly mitigating cross-lingual interference. Experiments demonstrate that our approach substantially outperforms baselines on zero-shot mixed-language ASR—without requiring any Mandarin–English mixed-language training data—establishing a novel paradigm for low-resource bilingual speech recognition.

Technology Category

Application Category

📝 Abstract
The kNN-CTC model has proven to be effective for monolingual automatic speech recognition (ASR). However, its direct application to multilingual scenarios like code-switching, presents challenges. Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. Our method selects the appropriate datastore for decoding each frame, ensuring the injection of language-specific information into the ASR process. We apply this framework to cutting-edge CTC-based models, developing an advanced CS-ASR system. Extensive experiments demonstrate the remarkable effectiveness of our gated datastore mechanism in enhancing the performance of zero-shot Chinese-English CS-ASR.
Problem

Research questions and friction points this paper is trying to address.

Bilingual Switching
Speech Recognition
Noise Reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

kNN-CTC System
Bilingual Switching Speech Recognition
Adaptive Database Selector
🔎 Similar Papers
No similar papers found.
J
Jiaming Zhou
TMCC, College of Computer Science, Nankai University, Tianjin, China
Shiwan Zhao
Shiwan Zhao
Independent Researcher, Research Scientist of IBM Research - China (2000-2020)
AGILarge Language ModelNLPSpeechRecommeder System
H
Hui Wang
TMCC, College of Computer Science, Nankai University, Tianjin, China
Tian-Hao Zhang
Tian-Hao Zhang
Phd, University of Science & Technology Beijing
Speech LLMASRTTS
Haoqin Sun
Haoqin Sun
Nankai University
Affective computingSpeech signal processingAudio understanding
X
Xuechen Wang
TMCC, College of Computer Science, Nankai University, Tianjin, China
Y
Yong Qin
TMCC, College of Computer Science, Nankai University, Tianjin, China