Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement

📅 2025-04-06
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of automatic speech recognition (ASR) in noisy and reverberant conditions, this paper introduces Schrödinger Bridge (SB) generative modeling—first applied to speech enhancement—for ASR-optimized denoising and dereverberation. Unlike diffusion models, SB offers both sampling efficiency and modeling flexibility. We systematically investigate how model scaling and sampling step count affect downstream ASR performance. Evaluated on standard noisy-reverberant benchmarks, our SB-based enhancement framework reduces word error rates (WER) of end-to-end ASR models—including Whisper and Wav2Vec 2.0—by approximately 40% relative to unprocessed noisy inputs, outperforming same-scale predictive enhancement methods by ~8% absolute WER. This work establishes a novel paradigm for generative speech enhancement tailored to ASR robustness and provides empirical validation of its efficacy.

Technology Category

Application Category

📝 Abstract
In this work, we investigate application of generative speech enhancement to improve the robustness of ASR models in noisy and reverberant conditions. We employ a recently-proposed speech enhancement model based on Schr""odinger bridge, which has been shown to perform well compared to diffusion-based approaches. We analyze the impact of model scaling and different sampling methods on the ASR performance. Furthermore, we compare the considered model with predictive and diffusion-based baselines and analyze the speech recognition performance when using different pre-trained ASR models. The proposed approach significantly reduces the word error rate, reducing it by approximately 40% relative to the unprocessed speech signals and by approximately 8% relative to a similarly sized predictive approach.
Problem

Research questions and friction points this paper is trying to address.

Improve ASR robustness in noisy environments
Compare Schrxf6dinger bridge with diffusion methods
Reduce word error rate significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Schrxf6dinger bridge for speech enhancement
Compares with diffusion and predictive baselines
Reduces 40% WER reduction over unprocessed speech
🔎 Similar Papers
R
Rauf Nasretdinov
NVIDIA
R
Roman Korostik
NVIDIA
Ante Jukić
Ante Jukić
NVIDIA
machine learningsignal processing