🤖 AI Summary
This work addresses the privacy risk posed by zero-shot text-to-speech synthesis models, which can be misused to generate speech mimicking specific individuals. To mitigate this concern, the authors propose TruS, a framework that enables “speaker forgetting” during inference by modulating speaker-related hidden activations, without requiring any retraining. TruS represents the first training-free mechanism for speaker forgetting, effectively preventing the generation of target speaker voices—both those seen during training and unseen speakers—while preserving non-identity attributes such as prosody and emotion. This approach offers an efficient and scalable privacy-preserving solution for speech synthesis systems, striking a balance between utility and speaker identity protection.
📝 Abstract
Modern zero-shot text-to-speech (TTS) models offer unprecedented expressivity but also pose serious crime risks, as they can synthesize voices of individuals who never consented. In this context, speaker unlearning aims to prevent the generation of specific speaker identities upon request. Existing approaches, reliant on retraining, are costly and limited to speakers seen in the training set. We present TruS, a training-free speaker unlearning framework that shifts the paradigm from data deletion to inference-time control. TruS steers identity-specific hidden activations to suppress target speakers while preserving other attributes (e.g., prosody and emotion). Experimental results show that TruS effectively prevents voice generation on both seen and unseen opt-out speakers, establishing a scalable safeguard for speech synthesis. The demo and code are available on http://mmai.ewha.ac.kr/trus.