🤖 AI Summary
This work addresses the vulnerability of existing voice anonymization systems, which, despite concealing speaker identity, may still leak distinctive speaker-specific patterns and lack reliable privacy evaluation mechanisms. To this end, the authors propose a dual-stream attack model that integrates spectral and self-supervised representations, accompanied by a three-stage progressive training strategy: foundational representation learning, cross-system generalization, and lightweight fine-tuning. A novel cross-system robustness training mechanism—inspired by commonalities between voice conversion and anonymization—is introduced, enabling efficient adaptation with only 10% of target data. Experiments on the VPAC dataset demonstrate that Stage II is critical for generalization, and incorporating Stage III yields significantly lower equal error rates (EER) under low-resource fine-tuning compared to state-of-the-art attack methods.
📝 Abstract
Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encoders with a three-stage training strategy. Stage I establishes foundational speaker-discriminative representations. Stage II leverages the shared identity-transformation characteristics of voice conversion and anonymization, exposing the model to diverse converted speech to build cross-system robustness. Stage III provides lightweight adaptation to target anonymized data. Results on the VoicePrivacy Attacker Challenge (VPAC) dataset demonstrate that Stage II is the primary driver of generalization, enabling strong attacking performance on unseen anonymization datasets. With Stage III, fine-tuning on only 10\% of the target anonymization dataset surpasses current state-of-the-art attackers in terms of EER.