Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the critical privacy risk in zero-shot voice synthesis, where speaker identities are difficult to effectively remove, leading to potential leakage of sensitive information. The paper formalizes this issue as "Speaker-Guided Speech Poisoning" (SGSP) and proposes two inference-time strategies—filtering and parameter modification—to suppress the generation of target speakers’ voices without retraining, while preserving synthesis quality for non-target speakers. The authors introduce the first evaluation framework for SGSP, incorporating metrics such as Word Error Rate (WER), Area Under the Curve (AUC), and Forgotten Speaker Similarity (FSSIM) to systematically assess the privacy–utility trade-off. Experiments demonstrate strong privacy protection for up to 15 forgotten speakers, but performance degrades significantly at scale (e.g., 100 speakers) due to identity overlap, revealing a key scalability limitation in current approaches.

Technology Category

Application Category

📝 Abstract

Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot Text-to-Speech

Speaker Identity Removal

Voice Privacy

Machine Unlearning

Speech Generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot TTS

Speaker Poisoning

Machine Unlearning