🤖 AI Summary
Fine-tuning speech representation models often improves downstream task performance but degrades generalization capability. To address this trade-off, we propose Speech-FT—a model fusion-based fine-tuning strategy that systematically introduces fusion mechanisms into speech representation adaptation for the first time. Speech-FT achieves synergistic balance between task adaptability and generalization robustness via parameter-space interpolation, multi-stage optimization, and cross-architecture compatibility. Evaluated across diverse speech tasks—including automatic speech recognition (ASR), speech emotion recognition (SER), and keyword spotting (KWS)—it yields an average 2.3% performance gain. Crucially, in zero-shot transfer settings, it retains 98.6% of the original model’s generalization ability, substantially outperforming standard fine-tuning. Its core innovation lies in lightweight fusion—replacing full-parameter updates with selective, knowledge-preserving interpolation—enabling task-specific enhancement without eroding pre-trained representations.
📝 Abstract
Speech representation models are highly effective at extracting general features for various tasks. While fine-tuning can enhance these representations for specific applications, it often compromises their generalization ability. To address this challenge, we propose Speech-FT, a fine-tuning strategy for speech representation models that leverages model merging to preserve generalization ability while still benefiting from fine-tuning. Speech-FT is effective across different fine-tuning scenarios and is compatible with various types of speech representation models, providing a versatile solution. Speech-FT offers an efficient and practical approach to further improving general speech representations after pre-training.