🤖 AI Summary
Existing codec-language TTS models suffer from parameter redundancy, low fine-tuning efficiency, and catastrophic forgetting in emotion expression and speaker adaptation—primarily due to the lack of explicit separation between task-relevant feature layers. To address this, we propose a feature-aware hierarchical selective fine-tuning strategy. Our method introduces a novel layer-wise importance assessment mechanism based on Transformer-layer weighted contribution analysis, enabling differentiated freezing or fine-tuning of high- and low-sensitivity layers for emotion and speaker attributes—thus achieving task decoupling and parameter-efficient adaptation. With only ~8% of parameters fine-tuned, training speed doubles while speech quality matches or surpasses full fine-tuning. The approach achieves self-supervised state-of-the-art performance on cross-speaker emotional synthesis, speaker identification, and emotion classification, significantly mitigating catastrophic forgetting.
📝 Abstract
Recently, emotional speech generation and speaker cloning have garnered significant interest in text-to-speech (TTS). With the open-sourcing of codec language TTS models trained on massive datasets with large-scale parameters, adapting these general pre-trained TTS models to generate speech with specific emotional expressions and target speaker characteristics has become a topic of great attention. Common approaches, such as full and adapter-based fine-tuning, often overlook the specific contributions of model parameters to emotion and speaker control. Treating all parameters uniformly during fine-tuning, especially when the target data has limited content diversity compared to the pre-training corpus, results in slow training speed and an increased risk of catastrophic forgetting. To address these challenges, we propose a characteristic-specific partial fine-tuning strategy, short as CSP-FT. First, we use a weighted-sum approach to analyze the contributions of different Transformer layers in a pre-trained codec language TTS model for emotion and speaker control in the generated speech. We then selectively fine-tune the layers with the highest and lowest characteristic-specific contributions to generate speech with target emotional expression and speaker identity. Experimental results demonstrate that our method achieves performance comparable to, or even surpassing, full fine-tuning in generating speech with specific emotional expressions and speaker identities. Additionally, CSP-FT delivers approximately 2x faster training speeds, fine-tunes only around 8% of parameters, and significantly reduces catastrophic forgetting. Furthermore, we show that codec language TTS models perform competitively with self-supervised models in speaker identification and emotion classification tasks, offering valuable insights for developing universal speech processing models.