Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing codec-language TTS models suffer from parameter redundancy, low fine-tuning efficiency, and catastrophic forgetting in emotion expression and speaker adaptation—primarily due to the lack of explicit separation between task-relevant feature layers. To address this, we propose a feature-aware hierarchical selective fine-tuning strategy. Our method introduces a novel layer-wise importance assessment mechanism based on Transformer-layer weighted contribution analysis, enabling differentiated freezing or fine-tuning of high- and low-sensitivity layers for emotion and speaker attributes—thus achieving task decoupling and parameter-efficient adaptation. With only ~8% of parameters fine-tuned, training speed doubles while speech quality matches or surpasses full fine-tuning. The approach achieves self-supervised state-of-the-art performance on cross-speaker emotional synthesis, speaker identification, and emotion classification, significantly mitigating catastrophic forgetting.

Technology Category

Application Category

📝 Abstract
Recently, emotional speech generation and speaker cloning have garnered significant interest in text-to-speech (TTS). With the open-sourcing of codec language TTS models trained on massive datasets with large-scale parameters, adapting these general pre-trained TTS models to generate speech with specific emotional expressions and target speaker characteristics has become a topic of great attention. Common approaches, such as full and adapter-based fine-tuning, often overlook the specific contributions of model parameters to emotion and speaker control. Treating all parameters uniformly during fine-tuning, especially when the target data has limited content diversity compared to the pre-training corpus, results in slow training speed and an increased risk of catastrophic forgetting. To address these challenges, we propose a characteristic-specific partial fine-tuning strategy, short as CSP-FT. First, we use a weighted-sum approach to analyze the contributions of different Transformer layers in a pre-trained codec language TTS model for emotion and speaker control in the generated speech. We then selectively fine-tune the layers with the highest and lowest characteristic-specific contributions to generate speech with target emotional expression and speaker identity. Experimental results demonstrate that our method achieves performance comparable to, or even surpassing, full fine-tuning in generating speech with specific emotional expressions and speaker identities. Additionally, CSP-FT delivers approximately 2x faster training speeds, fine-tunes only around 8% of parameters, and significantly reduces catastrophic forgetting. Furthermore, we show that codec language TTS models perform competitively with self-supervised models in speaker identification and emotion classification tasks, offering valuable insights for developing universal speech processing models.
Problem

Research questions and friction points this paper is trying to address.

Adaptive TTS
Emotional Expression
Speaker Mimicry
Innovation

Methods, ideas, or system contributions that make the work stand out.

CSP-FT
Efficient Learning
Emotion and Speaker Adaptation
🔎 Similar Papers
No similar papers found.
Tianrui Wang
Tianrui Wang
Tianjin University
Speech Signal Processing
Meng Ge
Meng Ge
Tianjin University; CUHK-Shenzhen; National University of Singapore
C
Cheng Gong
Institute of Artificial Intelligence (TeleAI), China Telecom, Beijing, China
Chunyu Qiang
Chunyu Qiang
Kuaishou Technology; TJU; CASIA
Speech Synthesis
H
Haoyu Wang
Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China
Z
Zikang Huang
Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China
Y
Yu Jiang
Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China
Xiaobao Wang
Xiaobao Wang
天津大学 Associate Professor
人工智能,大模型生成安全,图机器学习
X
Xie Chen
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Longbiao Wang
Longbiao Wang
Professor, Tianjin University
Speech ProcessingSpeech recognitionspeaker recognitionacoustic signal processingspeech enhancement
Jianwu Dang
Jianwu Dang
JAIST, Japan / Tianjin Univ., China
Speech Sciencespeech productionEEGdisorder speech