Speech-FT: A Fine-tuning Strategy for Enhancing Speech Representation Models Without Compromising Generalization Ability

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Fine-tuning speech representation models often improves downstream task performance but degrades generalization capability. To address this trade-off, we propose Speech-FT—a model fusion-based fine-tuning strategy that systematically introduces fusion mechanisms into speech representation adaptation for the first time. Speech-FT achieves synergistic balance between task adaptability and generalization robustness via parameter-space interpolation, multi-stage optimization, and cross-architecture compatibility. Evaluated across diverse speech tasks—including automatic speech recognition (ASR), speech emotion recognition (SER), and keyword spotting (KWS)—it yields an average 2.3% performance gain. Crucially, in zero-shot transfer settings, it retains 98.6% of the original model’s generalization ability, substantially outperforming standard fine-tuning. Its core innovation lies in lightweight fusion—replacing full-parameter updates with selective, knowledge-preserving interpolation—enabling task-specific enhancement without eroding pre-trained representations.

Technology Category

Application Category

📝 Abstract

Speech representation models are highly effective at extracting general features for various tasks. While fine-tuning can enhance these representations for specific applications, it often compromises their generalization ability. To address this challenge, we propose Speech-FT, a fine-tuning strategy for speech representation models that leverages model merging to preserve generalization ability while still benefiting from fine-tuning. Speech-FT is effective across different fine-tuning scenarios and is compatible with various types of speech representation models, providing a versatile solution. Speech-FT offers an efficient and practical approach to further improving general speech representations after pre-training.

Problem

Research questions and friction points this paper is trying to address.

Enhance speech representation models

Preserve generalization ability

Compatible with various models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning strategy Speech-FT

Preserves generalization ability

Compatible with various models

🔎 Similar Papers

dMel: Speech Tokenization made Simple