Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing codec-language TTS models suffer from parameter redundancy, low fine-tuning efficiency, and catastrophic forgetting in emotion expression and speaker adaptation—primarily due to the lack of explicit separation between task-relevant feature layers. To address this, we propose a feature-aware hierarchical selective fine-tuning strategy. Our method introduces a novel layer-wise importance assessment mechanism based on Transformer-layer weighted contribution analysis, enabling differentiated freezing or fine-tuning of high- and low-sensitivity layers for emotion and speaker attributes—thus achieving task decoupling and parameter-efficient adaptation. With only ~8% of parameters fine-tuned, training speed doubles while speech quality matches or surpasses full fine-tuning. The approach achieves self-supervised state-of-the-art performance on cross-speaker emotional synthesis, speaker identification, and emotion classification, significantly mitigating catastrophic forgetting.

Technology Category

Application Category

📝 Abstract

Recently, emotional speech generation and speaker cloning have garnered significant interest in text-to-speech (TTS). With the open-sourcing of codec language TTS models trained on massive datasets with large-scale parameters, adapting these general pre-trained TTS models to generate speech with specific emotional expressions and target speaker characteristics has become a topic of great attention. Common approaches, such as full and adapter-based fine-tuning, often overlook the specific contributions of model parameters to emotion and speaker control. Treating all parameters uniformly during fine-tuning, especially when the target data has limited content diversity compared to the pre-training corpus, results in slow training speed and an increased risk of catastrophic forgetting. To address these challenges, we propose a characteristic-specific partial fine-tuning strategy, short as CSP-FT. First, we use a weighted-sum approach to analyze the contributions of different Transformer layers in a pre-trained codec language TTS model for emotion and speaker control in the generated speech. We then selectively fine-tune the layers with the highest and lowest characteristic-specific contributions to generate speech with target emotional expression and speaker identity. Experimental results demonstrate that our method achieves performance comparable to, or even surpassing, full fine-tuning in generating speech with specific emotional expressions and speaker identities. Additionally, CSP-FT delivers approximately 2x faster training speeds, fine-tunes only around 8% of parameters, and significantly reduces catastrophic forgetting. Furthermore, we show that codec language TTS models perform competitively with self-supervised models in speaker identification and emotion classification tasks, offering valuable insights for developing universal speech processing models.

Problem

Research questions and friction points this paper is trying to address.

Adaptive TTS

Emotional Expression

Speaker Mimicry

Innovation

Methods, ideas, or system contributions that make the work stand out.

CSP-FT

Efficient Learning

Emotion and Speaker Adaptation

🔎 Similar Papers

No similar papers found.