LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multilingual speaker recognition, linguistic information and speaker-specific acoustic characteristics are highly entangled in speaker embeddings, severely degrading cross-lingual robustness. To address this, we propose a prefix-tuning-based cross-attention disentanglement framework that—novelty—integrates prefix tuning with cross-modal cross-attention to explicitly extract language-invariant speaker representations. Furthermore, we unify joint contrastive learning with multilingual self-supervised pretraining to enable zero-shot language generalization. Evaluated on multiple benchmark datasets, our method achieves significant EER reductions and maintains high identification accuracy across monolingual, multilingual, and unseen-language scenarios. It notably enhances both generalizability and robustness of cross-lingual speaker verification without requiring language labels or target-language data during inference.

Technology Category

Application Category

📝 Abstract
Speaker recognition models face challenges in multi-lingual settings due to the entanglement of linguistic information within speaker embeddings. The overlap between vocal traits such as accent, vocal anatomy, and a language's phonetic structure complicates separating linguistic and speaker information. Disentangling these components can significantly improve speaker recognition accuracy. To this end, we propose a novel disentanglement learning strategy that integrates joint learning through prefix-tuned cross-attention. This approach is particularly effective when speakers switch between languages. Experimental results show the model generalizes across monolingual and multi-lingual settings, including unseen languages. Notably, the proposed model improves the equal error rate across multiple datasets, highlighting its ability to separate language information from speaker embeddings and enhance recognition in diverse linguistic conditions.
Problem

Research questions and friction points this paper is trying to address.

Disentangling linguistic and speaker information in embeddings
Improving speaker recognition in multi-lingual settings
Enhancing accuracy when speakers switch languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefix-tuned cross-attention for speaker disentanglement
Joint learning strategy for multilingual settings
Generalizes across seen and unseen languages
🔎 Similar Papers
No similar papers found.