CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of personalized speech recognition in multi-talker scenarios with overlapping speech by proposing the CALM framework, which unifies target-speaker conditioning and dynamic contextual lexical biasing within an end-to-end system for the first time. CALM jointly models acoustic and linguistic context by leveraging speaker embeddings to guide target-speaker extraction and dynamically integrating personalized language priors. The approach achieves substantial performance gains: on LibriSpeech2Mix, the B-WER improves from 12.7 to 4.7, and on CSJMix2, the B-CER drops from 16.6 to 8.4. Furthermore, experiments on the AMI corpus demonstrate strong generalization capability, confirming the robustness of the proposed method across diverse conversational settings.

Technology Category

Application Category

📝 Abstract

We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.

Problem

Research questions and friction points this paper is trying to address.

multi-speaker ASR

personalization

contextual biasing

acoustic-linguistic modeling

overlapping speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual Acoustic-Linguistic Modeling

Personalized ASR

Speaker Embedding