Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

📅 2024-09-16

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address key challenges in unsupervised speaker identification—namely, heavy reliance on strong self-supervised pretraining, poor cross-domain generalization, and training instability—this paper proposes an Iterative Pseudo-Labeling (IPL) framework built upon an i-vector generative model. We theoretically and empirically demonstrate, for the first time, that the classical i-vector model suffices as a lightweight initial representation extractor, enabling high-quality self-training without complex self-supervised initialization. A systematic ablation study disentangles the impacts of initial model quality, encoder architecture, data augmentation, clustering algorithms (k-means vs. spectral clustering), and cluster count on IPL convergence. Additionally, contrastive learning fine-tuning is incorporated to enhance discriminability. On standard speaker verification benchmarks, our method achieves state-of-the-art performance while significantly reducing dependency on pretraining, improving cross-domain generalization, and enhancing training robustness.

Technology Category

Application Category

📝 Abstract

Iterative self-training, or iterative pseudo-labeling (IPL) -- using an improved model from the current iteration to provide pseudo-labels for the next iteration -- has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameter tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show that the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for the unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised Speaker Recognition

Model Adaptability

Complexity Reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

i-vector based Iterative Self-training

Speaker Recognition

Adaptability

🔎 Similar Papers

Self-Supervised Reflective Learning Through Self-Distillation and Online Clustering for Speaker Representation Learning