🤖 AI Summary
Existing two-stage iterative frameworks for speaker representation learning from unlabeled speech data suffer from high computational overhead and severe noise in pseudo-labels. To address these issues, we propose a self-supervised reflective learning framework that eliminates multi-round iteration. Our method jointly optimizes pseudo-labels via teacher-student self-distillation and online clustering, while incorporating explicit noise-aware label modeling and a temporally consistent pseudo-label queue to enable single-pass dynamic purification and continuous refinement of pseudo-labels. This work introduces the “reflective learning” paradigm to speaker representation learning for the first time. On VoxCeleb, our framework achieves superior performance in a single training pass compared to conventional five-iteration baselines. Pseudo-label quality improves significantly, and cluster count converges rapidly—demonstrating its efficiency and robustness in parsing large-scale unlabeled speech data.
📝 Abstract
Speaker representation learning is crucial for voice recognition systems, with recent advances in self-supervised approaches reducing dependency on labeled data. Current two-stage iterative frameworks, while effective, suffer from significant computational overhead due to repeated rounds of clustering and training. They also struggle with noisy pseudo labels that can impair model learning. This paper introduces self-supervised reflective learning (SSRL), an improved framework that addresses these limitations by enabling continuous refinement of pseudo labels during training. Through a teacher-student architecture and online clustering mechanism, SSRL eliminates the need for iterative training rounds. To handle label noise, we incorporate noisy label modeling and pseudo label queues that maintain temporal consistency. Experiments on VoxCeleb show SSRL's superiority over current two-stage iterative approaches, surpassing the performance of a 5-round method in just a single training round. Ablation studies validate the contributions of key components like noisy label modeling and pseudo label queues. Moreover, consistent improvements in pseudo labeling and the convergence of cluster counts demonstrate SSRL's effectiveness in deciphering unlabeled data. This work marks an important advancement in efficient and accurate self-supervised speaker representation learning through the novel reflective learning paradigm.