🤖 AI Summary
This work addresses the bottleneck in end-to-end automatic speech recognition (E2E-ASR) where contextual biasing relies on manually curated phoneme dictionaries. We propose the first data-driven Automatic Text–Pronunciation Correspondence (ATPC) method, eliminating the need for any predefined phonetic resources. Our approach leverages weakly supervised speech–text pairs and introduces an Iterative Timestamp Estimator (ITSE) to achieve coarse-grained alignment between text and speech. Within an end-to-end framework, we jointly optimize pronunciation modeling and ASR by integrating a speech encoder with an embedding-distance-based similarity metric. Experiments on multidiialectal Chinese demonstrate substantial improvements in contextual biasing performance. Moreover, the method exhibits strong generalization to unwritten or low-resource languages and dialects lacking phoneme inventories. This work establishes a scalable, resource-agnostic paradigm for pronunciation modeling in low-resource ASR.
📝 Abstract
Effectively distinguishing the pronunciation correlations between different written texts is a significant issue in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). The supervision required for this method is consistent with the supervision needed for training end-to-end automatic speech recognition (E2E-ASR) systems, i.e., speech and corresponding text annotations. First, the iteratively-trained timestamp estimator (ITSE) algorithm is employed to align the speech with their corresponding annotated text symbols. Then, a speech encoder is used to convert the speech into speech embeddings. Finally, we compare the speech embeddings distances of different text symbols to obtain ATPC. Experimental results on Mandarin show that ATPC enhances E2E-ASR performance in contextual biasing and holds promise for dialects or languages lacking artificial pronunciation lexicons.