Text adaptation for speaker verification with speaker-text factorized embeddings

πŸ“… 2025-08-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Text-dependent speaker verification (SV) suffers from performance degradation when training/registration and test utterances exhibit textual mismatch. To address this, we propose a text-adaptive framework comprising: (i) a speaker-text disentanglement network that decomposes speech representations into orthogonal speaker and text embeddings; (ii) unsupervised adaptation from text-independent to text-customized speaker embeddings using only a small amount of target-text speech data without speaker labels; and (iii) post-hoc calibration of speaker embeddings via fusion with text embeddings. Experiments on RSR2015 demonstrate substantial improvements in verification accuracy under text-mismatched conditions. Notably, our method achieves text-aware embedding adaptation without requiring any target-speaker utterancesβ€”a first in the literature. This establishes a novel paradigm for low-resource, highly generalizable text-dependent SV.

Technology Category

Application Category

πŸ“ Abstract
Text mismatch between pre-collected data, either training data or enrollment data, and the actual test data can significantly hurt text-dependent speaker verification (SV) system performance. Although this problem can be solved by carefully collecting data with the target speech content, such data collection could be costly and inflexible. In this paper, we propose a novel text adaptation framework to address the text mismatch issue. Here, a speaker-text factorization network is proposed to factorize the input speech into speaker embeddings and text embeddings and then integrate them into a single representation in the later stage. Given a small amount of speaker-independent adaptation utterances, text embeddings of target speech content can be extracted and used to adapt the text-independent speaker embeddings to text-customized speaker embeddings. Experiments on RSR2015 show that text adaptation can significantly improve the performance of text mismatch conditions.
Problem

Research questions and friction points this paper is trying to address.

Text mismatch harms speaker verification performance
Costly data collection for target speech content
Adapt speaker embeddings using text embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker-text factorization network for embeddings
Text adaptation using speaker-independent utterances
Text-customized speaker embeddings improve verification
πŸ”Ž Similar Papers
No similar papers found.
Yexin Yang
Yexin Yang
Shanghai Jiao Tong University
Speaker VerificationSpeech ProcessingDeep LearningMachine Learning
S
Shuai Wang
MoE Key Lab of Artificial Intelligence, SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
X
Xun Gong
MoE Key Lab of Artificial Intelligence, SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Yanmin Qian
Yanmin Qian
Professor, Shanghai Jiao Tong University
Speech and Language ProcessingSignal ProcessingMachine Learning
K
Kai Yu
MoE Key Lab of Artificial Intelligence, SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China