HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-prompt-based controllable speech synthesis methods neglect the intrinsic hierarchical distribution of global style embeddings during prediction, leading to insufficient semantic–acoustic alignment. Method: We observe that style embeddings exhibit a “timbre-first, style-second” hierarchical clustering structure; accordingly, we propose a two-stage hierarchical style modeling framework: (1) coarse-grained timbre-dominant embedding prediction, followed by (2) fine-grained style semantic refinement. We further incorporate contrastive learning to strengthen text–audio cross-modal alignment and jointly optimize prompt annotation quality via statistical analysis and human calibration. t-SNE visualization validates the hypothesized embedding distribution. Contribution/Results: Experiments demonstrate that our method significantly improves style controllability over state-of-the-art embedding prediction approaches while preserving speech naturalness and intelligibility, yielding synthesized speech that more accurately reflects the intended semantics of text prompts.

Technology Category

Application Category

📝 Abstract
Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes, such as gender, volume, speech rate, pitch, and pitch fluctuation. With the integration of advanced generative models, particularly large language models (LLMs) and diffusion models, controllable text-to-speech (TTS) systems have increasingly transitioned from label-based control to natural language description-based control, which is typically implemented by predicting global style embeddings from textual prompts. However, this straightforward prediction overlooks the underlying distribution of the style embeddings, which may hinder the full potential of controllable TTS systems. In this study, we use t-SNE analysis to visualize and analyze the global style embedding distribution of various mainstream TTS systems, revealing a clear hierarchical clustering pattern: embeddings first cluster by timbre and subsequently subdivide into finer clusters based on style attributes. Based on this observation, we propose HiStyle, a two-stage style embedding predictor that hierarchically predicts style embeddings conditioned on textual prompts, and further incorporate contrastive learning to help align the text and audio embedding spaces. Additionally, we propose a style annotation strategy that leverages the complementary strengths of statistical methodologies and human auditory preferences to generate more accurate and perceptually consistent textual prompts for style control. Comprehensive experiments demonstrate that when applied to the base TTS model, HiStyle achieves significantly better style controllability than alternative style embedding predicting approaches while preserving high speech quality in terms of naturalness and intelligibility. Audio samples are available at https://anonymous.4open.science/w/HiStyle-2517/.
Problem

Research questions and friction points this paper is trying to address.

Predicting hierarchical style embeddings from text prompts for controllable speech synthesis
Addressing limitations of global style embedding prediction in TTS systems
Improving style controllability while maintaining speech quality and naturalness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical style embedding predictor for text prompts
Contrastive learning aligns text and audio embeddings
Statistical and human preference-based style annotation strategy
🔎 Similar Papers
No similar papers found.
Z
Ziyu Zhang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Hanzhao Li
Hanzhao Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern
Speech SynthesisSpontaneous SpeechSpeech Codec
J
Jingbin Hu
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
W
Wenhao Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China