Segment-Aware Conditioning for Training-Free Intra-Utterance Emotion and Duration Control in Text-to-Speech

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a zero-shot controllable framework that enables fine-grained intra-sentence control over emotion and prosody in pre-trained text-to-speech (TTS) systems without requiring additional training. Addressing the limitation of existing approaches that typically support only utterance-level control, the method introduces a segment-aware emotion–prosody scheduling strategy. It integrates causal masking, monotonic flow alignment filtering, local duration embedding guidance, and global EOS logit modulation to achieve precise editing of emotional expression and duration at the sub-utterance level. Leveraging a large language model to automatically generate segmentation prompts eliminates the need for manual annotation. The approach maintains high naturalness while achieving state-of-the-art semantic consistency on a 30,000-sample annotated dataset, marking the first demonstration of fine-grained, multi-emotion and rhythm control within sentences without model fine-tuning.

Technology Category

Application Category

📝 Abstract
While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation, allowing local duration adjustment while ensuring globally consistent termination. To eliminate the need for segment-level manual prompt engineering, we construct a 30,000-sample multi-emotion and duration-annotated text dataset to enable LLM-based automatic prompt construction. Extensive experiments demonstrate that our training-free method not only achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control, but also maintains baseline-level speech quality of the underlying TTS model. Audio samples are available at https://aclanonymous111.github.io/TED-TTS-DemoPage/.
Problem

Research questions and friction points this paper is trying to address.

controllable Text-to-Speech
intra-utterance control
emotion variation
duration control
training-free
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
intra-utterance control
segment-aware conditioning
emotion modulation
duration steering
🔎 Similar Papers
No similar papers found.
Q
Qifan Liang
School of Computing, National University of Singapore
Y
Yuansen Liu
School of Computing, National University of Singapore
R
Ruixin Wei
School of Computing, National University of Singapore
Nan Lu
Nan Lu
University of Tübingen
Machine Learning
J
Junchuan Zhao
School of Computing, National University of Singapore
Ye Wang
Ye Wang
School of Computing, National University of Singapore
Sound and Music ComputingSensor ComputingeHealth