Regularizing Learnable Feature Extraction for Automatic Speech Recognition

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Learnable speech front-ends in end-to-end ASR are prone to overfitting and often underperform traditional handcrafted features (e.g., MFCCs or log-Mel spectrograms). To enhance their generalization and training stability, we systematically investigate regularization strategies: (1) We propose a novel STFT-domain masking technique that replaces standard SpecAugment, effectively mitigating its inherent spectral mismatch and gradient-blocking issues in learnable front-ends; (2) We empirically demonstrate— for the first time—that audio-level perturbations yield significantly greater performance gains for learnable front-ends than for fixed-feature pipelines; (3) We introduce a joint regularization framework that co-regularizes the neural front-end and ASR backend. Evaluated on major benchmarks including LibriSpeech, our approach closes the performance gap between learnable and conventional features, achieving comparable or superior recognition accuracy.

Technology Category

Application Category

📝 Abstract
Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features.
Problem

Research questions and friction points this paper is trying to address.

Investigates regularization for learnable ASR feature extraction
Addresses overfitting in neural front-ends for speech recognition
Improves performance gap between traditional and learnable features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Regularizing learnable feature extraction for ASR
Audio perturbation improves learnable features
STFT-domain masking enhances SpecAugment effectiveness
🔎 Similar Papers
No similar papers found.
P
Peter Vieting
Machine Learning and Human Language Technology Group, RWTH Aachen University, Germany
M
Maximilian Kannen
Machine Learning and Human Language Technology Group, RWTH Aachen University, Germany
Benedikt Hilmes
Benedikt Hilmes
PhD Student at RWTH Aachen University
Automatic Speech RecognitionMachine TranslationText to Speech
R
Ralf Schluter
Machine Learning and Human Language Technology Group, RWTH Aachen University, Germany; AppTek GmbH, Germany
Hermann Ney
Hermann Ney
RWTH Aachen University
Machine LearningSpeech RecognitionMachine TranslationComputer Vision