🤖 AI Summary
This study identifies a critical security threat to speaker verification and voice anonymization systems: speaker identity information implicitly encoded in temporal prosodic dynamics—such as rhythm, intonation, and speaking rate variations. To address this, we propose a context-dependent phoneme-duration embedding method that leverages deep learning to model speaker-specific prosodic representations, thereby overcoming the limitations of conventional spectral features. Based on this, we design a novel attack framework that significantly improves speaker identification accuracy on both original and anonymized speech—substantially outperforming existing baselines. Experiments demonstrate that mainstream voice anonymization techniques fail to adequately suppress temporal prosodic cues, resulting in severe speaker identity leakage. To our knowledge, this is the first systematic investigation establishing prosodic dynamics as a potent side channel undermining voice privacy. Our work introduces a new evaluation paradigm and provides a reproducible benchmark attack framework for assessing and strengthening the security of voice anonymization systems.
📝 Abstract
The temporal dynamics of speech, encompassing variations in rhythm, intonation, and speaking rate, contain important and unique information about speaker identity. This paper proposes a new method for representing speaker characteristics by extracting context-dependent duration embeddings from speech temporal dynamics. We develop novel attack models using these representations and analyze the potential vulnerabilities in speaker verification and voice anonymization systems.The experimental results show that the developed attack models provide a significant improvement in speaker verification performance for both original and anonymized data in comparison with simpler representations of speech temporal dynamics reported in the literature.