Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing text-to-speech (TTS) models exhibit notable limitations in prosody control and pronunciation error correction: prosody modulation typically requires dedicated modules or additional training, precluding post-hoc editing; pronunciation correction heavily relies on grapheme-to-phoneme dictionaries, resulting in poor generalization under low-resource conditions. This paper proposes a model-agnostic counterfactual activation editing method—the first to introduce counterfactual editing into TTS inference—enabling direct intervention in pretrained model hidden-layer representations. Our approach achieves end-to-end prosody control and mispronunciation correction without fine-tuning, external lexicons, or auxiliary modules. It leverages gradient-guided hidden-layer activation editing and counterfactual feature perturbation. Evaluated across multilingual TTS systems, our method significantly improves accent and pause control accuracy (+23.6%) and mispronunciation correction rate (+31.4%), while preserving prosodic naturalness and synthesis quality.

Technology Category

Application Category

📝 Abstract

Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.

Problem

Research questions and friction points this paper is trying to address.

Post-hoc prosody control in TTS models

Mispronunciation correction without retraining

Model-agnostic editing for speech synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Activation Editing for TTS

Post-hoc prosody and pronunciation control

Model-agnostic internal representation manipulation

🔎 Similar Papers

No similar papers found.