🤖 AI Summary
This work addresses the challenge of personalized speech recognition in multi-talker scenarios with overlapping speech by proposing the CALM framework, which unifies target-speaker conditioning and dynamic contextual lexical biasing within an end-to-end system for the first time. CALM jointly models acoustic and linguistic context by leveraging speaker embeddings to guide target-speaker extraction and dynamically integrating personalized language priors. The approach achieves substantial performance gains: on LibriSpeech2Mix, the B-WER improves from 12.7 to 4.7, and on CSJMix2, the B-CER drops from 16.6 to 8.4. Furthermore, experiments on the AMI corpus demonstrate strong generalization capability, confirming the robustness of the proposed method across diverse conversational settings.
📝 Abstract
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.