Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical privacy risk in zero-shot voice synthesis, where speaker identities are difficult to effectively remove, leading to potential leakage of sensitive information. The paper formalizes this issue as "Speaker-Guided Speech Poisoning" (SGSP) and proposes two inference-time strategies—filtering and parameter modification—to suppress the generation of target speakers’ voices without retraining, while preserving synthesis quality for non-target speakers. The authors introduce the first evaluation framework for SGSP, incorporating metrics such as Word Error Rate (WER), Area Under the Curve (AUC), and Forgotten Speaker Similarity (FSSIM) to systematically assess the privacy–utility trade-off. Experiments demonstrate strong privacy protection for up to 15 forgotten speakers, but performance degrades significantly at scale (e.g., 100 speakers) due to identity overlap, revealing a key scalability limitation in current approaches.

Technology Category

Application Category

📝 Abstract
Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot Text-to-Speech
Speaker Identity Removal
Voice Privacy
Machine Unlearning
Speech Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot TTS
Speaker Poisoning
Machine Unlearning
Voice Privacy
SGSP
🔎 Similar Papers
No similar papers found.
T
Thanapat Trachu
Thomas Lord Department of Computer Science, University of Southern California, USA
T
Thanathai Lertpetchpun
Signal Analysis and Interpretation Lab, University of Southern California, USA
Sai Praneeth Karimireddy
Sai Praneeth Karimireddy
USC
Machine LearningOptimizationPrivacyFederated learningData economy
S
Shrikanth Narayanan
Signal Analysis and Interpretation Lab, University of Southern California, USA