🤖 AI Summary
This work proposes a zero-shot controllable framework that enables fine-grained intra-sentence control over emotion and prosody in pre-trained text-to-speech (TTS) systems without requiring additional training. Addressing the limitation of existing approaches that typically support only utterance-level control, the method introduces a segment-aware emotion–prosody scheduling strategy. It integrates causal masking, monotonic flow alignment filtering, local duration embedding guidance, and global EOS logit modulation to achieve precise editing of emotional expression and duration at the sub-utterance level. Leveraging a large language model to automatically generate segmentation prompts eliminates the need for manual annotation. The approach maintains high naturalness while achieving state-of-the-art semantic consistency on a 30,000-sample annotated dataset, marking the first demonstration of fine-grained, multi-emotion and rhythm control within sentences without model fine-tuning.
📝 Abstract
While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation, allowing local duration adjustment while ensuring globally consistent termination. To eliminate the need for segment-level manual prompt engineering, we construct a 30,000-sample multi-emotion and duration-annotated text dataset to enable LLM-based automatic prompt construction. Extensive experiments demonstrate that our training-free method not only achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control, but also maintains baseline-level speech quality of the underlying TTS model. Audio samples are available at https://aclanonymous111.github.io/TED-TTS-DemoPage/.