Optimizing Speech Multi-View Feature Fusion through Conditional Computation

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficient fusion of self-supervised learning (SSL) representations—such as those from wav2vec 2.0—and handcrafted spectral features (e.g., FBanks) in speech modeling, caused by conflicting gradient directions during joint optimization, this paper proposes a conditional-computation-based multi-view feature fusion framework. Our method introduces: (1) a gradient-aware gating mechanism that dynamically modulates gradient flow across heterogeneous feature sources; and (2) a multi-stage dropout strategy to mitigate update conflicts arising from disparate representation dynamics. Evaluated on the multilingual MUST-C speech translation benchmark, the approach significantly accelerates convergence compared to baseline SSL-only models, achieves performance on par with purely spectral-feature-based systems, and simultaneously enhances generalization and robustness to acoustic perturbations.

Technology Category

Application Category

📝 Abstract
Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.
Problem

Research questions and friction points this paper is trying to address.

Self-Supervised Learning
Traditional Speech Representation
Feature Integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Feature Selection
Feature Random Dropping
Speech Translation Performance
W
Weiqiao Shan
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Y
Yuhao Zhang
The Chinese University of Hong Kong, Shenzhen, China
Y
Yuchen Han
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Bei Li
Bei Li
Meituan LLM Team
Machine TranslationDeep LearningLarge Language Models
X
Xiaofeng Zhao
Huawei Translation Services Center, Beijing, China
Yuang Li
Yuang Li
2012 Lab, Huawei
SpeechNLP
M
Min Zhang
Huawei Translation Services Center, Beijing, China
H
Hao Yang
Huawei Translation Services Center, Beijing, China
T
Tong Xiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research, Shenyang, China
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing