Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-speech (TTS) models exhibit notable limitations in prosody control and pronunciation error correction: prosody modulation typically requires dedicated modules or additional training, precluding post-hoc editing; pronunciation correction heavily relies on grapheme-to-phoneme dictionaries, resulting in poor generalization under low-resource conditions. This paper proposes a model-agnostic counterfactual activation editing method—the first to introduce counterfactual editing into TTS inference—enabling direct intervention in pretrained model hidden-layer representations. Our approach achieves end-to-end prosody control and mispronunciation correction without fine-tuning, external lexicons, or auxiliary modules. It leverages gradient-guided hidden-layer activation editing and counterfactual feature perturbation. Evaluated across multilingual TTS systems, our method significantly improves accent and pause control accuracy (+23.6%) and mispronunciation correction rate (+31.4%), while preserving prosodic naturalness and synthesis quality.

Technology Category

Application Category

📝 Abstract
Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.
Problem

Research questions and friction points this paper is trying to address.

Post-hoc prosody control in TTS models
Mispronunciation correction without retraining
Model-agnostic editing for speech synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Activation Editing for TTS
Post-hoc prosody and pronunciation control
Model-agnostic internal representation manipulation
🔎 Similar Papers
No similar papers found.