π€ AI Summary
This work addresses the challenges in Chinese contextual speech recognition, where large-scale keyword lexicons often introduce irrelevant candidates and homophonic or phonetically similar errors from the base ASR model severely degrade semantic integrity, rendering conventional semantic retrieval ineffective. To mitigate these issues, the paper proposes a dynamic lexicon filtering framework that jointly leverages semantic, phonetic (pinyin), and orthographic (character shape) featuresβthe first approach to integrate all three modalities to combat homophone interference. Furthermore, it introduces a sequence-level similarity scoring method based on an extended Smith-Waterman algorithm to enable precise alignment and reranking between N-best ASR hypotheses and target keywords. Experimental results on the Aishell-1 and RWCS-NER datasets demonstrate substantial improvements over single-feature baselines, significantly enhancing keyword recognition accuracy in downstream ASR systems.
π Abstract
Contextual Automatic Speech Recognition (ASR) faces challenges with large-scale keyword dictionaries, as excessive irrelevant candidates introduce noise that degrades accuracy. To address this, dynamic filtering typically uses a base ASR model to generate preliminary hypotheses, followed by semantic text retrievers to fetch a concise subset of relevant keywords. However, this approach frequently fails in Chinese ASR. Base models often produce homophonic or near-homophonic errors that preserve the phonetic cues of the target keywords but severely distort their semantic meaning, rendering standard semantic retrievers ineffective. To resolve this, we propose a filtering framework that jointly integrates Semantic, Pinyin, and Glyph features (JSPG). Pinyin effectively retrieves targets based on phonetic similarity, while glyph provides complementary structural cues to filter out numerous irrelevant homophones inherent in Chinese. To bridge the gap between character-level pinyin/glyph metrics and sequence-level filtering, we introduce an extended Smith-Waterman algorithm that computes similarity scores between the N-best hypothesis sequences and keywords. Experiments on the Aishell-1 and RWCS-NER datasets demonstrate that JSPG significantly outperforms single-feature baselines. Furthermore, downstream contextual ASR models guided by JSPG achieve substantial improvements in keyword recognition accuracy.