HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing instructable TTS models suffer from a modality gap between coarse-grained text instructions and fine-grained speech tokens, hindering precise prosodic and phonetic control. To address this, we propose HD-PPT—a Hierarchical Dual-Preference Prompting and Tokenization framework. First, we design a novel hierarchical speech codec that disentangles content-related and instruction-related speech tokens. Second, we introduce a dual-preference token extraction mechanism jointly supervised by ASR and CLAP to align textual instructions with hierarchical speech representations. Third, we establish a layered decoding process enabling controllable generation across semantic, prosodic, and phonemic levels. By integrating large language models, speech codec architectures, ASR, CLAP, and hierarchical modeling, HD-PPT achieves state-of-the-art performance in both instruction adherence and speech naturalness, significantly improving the accuracy and expressiveness of controllable TTS.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM)-based Text-to-Speech (TTS) models have already reached a high degree of naturalness. However, the precision control of TTS inference is still challenging. Although instruction-based Text-to-Speech (Instruct-TTS) models are proposed, these models still lack fine-grained control due to the modality gap between single-level text instructions and multilevel speech tokens. To address this limitation, we propose HD-PPT, a framework that transforms speech synthesis into a structured, hierarchical task. To enable fine-grained control, we introduce a novel speech codec to extract distinct prompt-preference and content-preference tokens from the complex speech tokens, supervised by automatic speech recognition (ASR) and cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality gap of these tokens, we propose a hierarchical decoding strategy, where the LLM generates tokens in a structured order: first semantic, then fine-grained style, and finally complete acoustic representation. Extensive experiments demonstrate that this hierarchical paradigm significantly improves instruction adherence and achieves state-of-the-art naturalness, validating our approach for precise and controllable speech synthesis. Audio samples are available at https://xxh333.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Achieving fine-grained control in instruction-based text-to-speech synthesis
Bridging the modality gap between text instructions and speech tokens
Enabling precise hierarchical control over semantic and acoustic properties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical decoding strategy for structured token generation
Novel speech codec extracting distinct preference tokens
Bridging modality gap with ASR and CLAP supervision
🔎 Similar Papers
No similar papers found.
S
Sihang Nie
South China University of Technology, Guangzhou, China
Xiaofen Xing
Xiaofen Xing
South China University of Technology
J
Jingyuan Xing
South China University of Technology, Guangzhou, China
B
Baiji Liu
South China University of Technology, Guangzhou, China; Guangzhou Quwan Network Technology, Guangzhou, China
Xiangmin Xu
Xiangmin Xu
South China University of Technology