Unifying Model and Layer Fusion for Speech Foundation Models

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the low representation fusion efficiency of speech foundation models in downstream tasks, this paper proposes the first unified framework integrating both model-level and layer-level fusion. The core innovation is a lightweight cross-model interface module that enables joint alignment and end-to-end optimization across multiple upstream speech models—whether self-supervised or supervised—as well as their internal layer-wise features. This module facilitates deep semantic integration across heterogeneous models, significantly enhancing fusion flexibility and representational capacity. Evaluated on automatic speech recognition (ASR) and language-like analysis tasks, the method consistently outperforms existing fusion strategies. It demonstrates strong scalability and consistent performance gains across varying model scales and numbers of constituent models, establishing a new state-of-the-art in speech representation fusion.

Technology Category

Application Category

📝 Abstract

Speech Foundation Models have gained significant attention recently. Prior works have shown that the fusion of representations from multiple layers of the same model or the fusion of multiple models can improve performance on downstream tasks. We unify these two fusion strategies by proposing an interface module that enables fusion across multiple upstream speech models while integrating information across their layers. We conduct extensive experiments on different self-supervised and supervised models across various speech tasks, including ASR and paralinguistic analysis, and demonstrate that our method outperforms prior fusion approaches. We further analyze its scalability concerning model size and count, highlighting the importance of selecting appropriate upstream models. Our results show that the proposed interface provides an additional performance boost when given a suitable upstream model selection, making it a promising approach for utilizing Speech Foundation Models.

Problem

Research questions and friction points this paper is trying to address.

Unifying model and layer fusion strategies for speech foundation models

Improving performance across speech tasks like ASR and paralinguistic analysis

Enhancing scalability through optimal upstream model selection and integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified fusion across models and layers

Interface module integrates multiple upstream models

Scalable approach for speech foundation models

🔎 Similar Papers

Computer Audition: From Task-Specific Machine Learning to Foundation Models