Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address speaker identity privacy leakage in zero-shot text-to-speech (ZS-TTS) models, this paper proposes Teacher-Guided Unlearning (TGU), the first machine unlearning framework tailored for ZS-TTS. TGU selectively removes a specified speaker’s identity via synergistic knowledge distillation and controllable random perturbation, preserving synthesis fidelity for all other speakers. A novel stochastic mechanism is introduced to suppress reconstruction of forgotten speaker utterances, and a dedicated metric—speaker Zero-Reconstruction Fidelity (spk-ZRF)—is designed to quantify unlearning efficacy. Experiments demonstrate that TGU completely eliminates the target speaker’s voice reconstruction capability while maintaining cross-speaker synthesis quality comparable to state-of-the-art ZS-TTS models. This work establishes a verifiable and scalable paradigm for privacy governance in speech synthesis models.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Zero-Shot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the first machine unlearning frameworks for ZS-TTS, especially Teacher-Guided Unlearning (TGU), designed to ensure the model forgets designated speaker identities while retaining its ability to generate accurate speech for other speakers. Our proposed methods incorporate randomness to prevent consistent replication of forget speakers' voices, assuring unlearned identities remain untraceable. Additionally, we propose a new evaluation metric, speaker-Zero Retrain Forgetting (spk-ZRF). This assesses the model's ability to disregard prompts associated with forgotten speakers, effectively neutralizing its knowledge of these voices. The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers' voices while maintaining high quality for other speakers. The demo is available at https://speechunlearn.github.io/

Problem

Research questions and friction points this paper is trying to address.

Removing specific speaker identities from TTS models

Preventing unauthorized voice replication in zero-shot systems

Ensuring privacy while maintaining speech synthesis quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-Guided Unlearning framework for ZS-TTS

Incorporates randomness to prevent voice replication

Proposes spk-ZRF metric to evaluate unlearning effectiveness

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

AI Research Scientist - Meta Superintelligence Labs (PhD)