🤖 AI Summary
To address strong cross-lingual interference and difficulty in modeling bilingual code-switching in zero-shot Mandarin–English mixed-language ASR, this paper proposes a kNN-CTC framework featuring dual monolingual datastores and a gating selection mechanism. Our method dynamically routes each frame to the language-specific datastore at the frame level, enabling language-aware contextual enhancement; it further employs monolingual embedding indexing with CTC alignment constraints to eliminate reliance on mixed-language data. Crucially, we depart from conventional bilingual-aligned datastore designs by decoupling language representations via the gating mechanism, thereby significantly mitigating cross-lingual interference. Experiments demonstrate that our approach substantially outperforms baselines on zero-shot mixed-language ASR—without requiring any Mandarin–English mixed-language training data—establishing a novel paradigm for low-resource bilingual speech recognition.
📝 Abstract
The kNN-CTC model has proven to be effective for monolingual automatic speech recognition (ASR). However, its direct application to multilingual scenarios like code-switching, presents challenges. Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. Our method selects the appropriate datastore for decoding each frame, ensuring the injection of language-specific information into the ASR process. We apply this framework to cutting-edge CTC-based models, developing an advanced CS-ASR system. Extensive experiments demonstrate the remarkable effectiveness of our gated datastore mechanism in enhancing the performance of zero-shot Chinese-English CS-ASR.