🤖 AI Summary
To address the low representation fusion efficiency of speech foundation models in downstream tasks, this paper proposes the first unified framework integrating both model-level and layer-level fusion. The core innovation is a lightweight cross-model interface module that enables joint alignment and end-to-end optimization across multiple upstream speech models—whether self-supervised or supervised—as well as their internal layer-wise features. This module facilitates deep semantic integration across heterogeneous models, significantly enhancing fusion flexibility and representational capacity. Evaluated on automatic speech recognition (ASR) and language-like analysis tasks, the method consistently outperforms existing fusion strategies. It demonstrates strong scalability and consistent performance gains across varying model scales and numbers of constituent models, establishing a new state-of-the-art in speech representation fusion.
📝 Abstract
Speech Foundation Models have gained significant attention recently. Prior works have shown that the fusion of representations from multiple layers of the same model or the fusion of multiple models can improve performance on downstream tasks. We unify these two fusion strategies by proposing an interface module that enables fusion across multiple upstream speech models while integrating information across their layers. We conduct extensive experiments on different self-supervised and supervised models across various speech tasks, including ASR and paralinguistic analysis, and demonstrate that our method outperforms prior fusion approaches. We further analyze its scalability concerning model size and count, highlighting the importance of selecting appropriate upstream models. Our results show that the proposed interface provides an additional performance boost when given a suitable upstream model selection, making it a promising approach for utilizing Speech Foundation Models.