PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

ASR systems face two key challenges in domain-specific named entity recognition: homophone ambiguity and insufficient contextual modeling—existing approaches neglect fine-grained phonemic distinctions and treat entities as isolated tokens, leading to incomplete multi-token bias correction. To address these issues, we propose the Phoneme-aware Contrastive Disambiguation framework for ASR (PCD-ASR), which jointly integrates phoneme-enhanced entity encoding, contrastive learning–driven entity disambiguation, entity-level supervision, and a hierarchical filtering strategy. This enables fine-grained phonetic discrimination and coordinated multi-token bias rectification. Evaluated on AISHELL-1 and DATA2 under 1,000-distractor conditions, PCD-ASR achieves 4.22% character error rate (CER) and 11.14% word error rate (WER), respectively. Moreover, it demonstrates significantly improved robustness and reduced low-noise false positive rates on cross-domain benchmarks including THCHS-30 and LibriSpeech.

Technology Category

Application Category

📝 Abstract

Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.

Problem

Research questions and friction points this paper is trying to address.

Improving recognition of domain-specific named entities

Addressing homophone challenges in contextual ASR

Enhancing multi-token entity biasing with phoneme awareness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phoneme-aware encoding enhances phonetic discrimination

Contrastive disambiguation ensures complete entity retrieval

Hierarchical filtering reduces false positive rates

🔎 Similar Papers

No similar papers found.