Frequency-Weighted Training Losses for Phoneme-Level DNN-based Speech Enhancement

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Traditional SDR loss struggles to preserve phoneme intelligibility in multi-channel speech enhancement due to its inability to retain fine-grained spectral cues critical for articulation. To address this, we propose a perception-driven frequency-domain weighted SDR loss. Our method integrates dual-path weighting: a fixed component based on ANSI standard spectral weighting and an adaptive component dynamically estimated from spectral magnitude and time-frequency signal-to-noise ratio, thereby emphasizing reconstruction accuracy in phonetically salient and heavily noise-corrupted regions. Integrated into the FaSNet architecture for end-to-end training, the proposed loss yields only a marginal gain in conventional SDR (+0.15 dB) but achieves a substantial 2.3 dB improvement in weighted SDR and an 8.7% increase in consonant recognition accuracy. Spectral and phoneme-level analyses confirm significantly enhanced reconstruction fidelity for transient consonants—e.g., [ts] and [k]—demonstrating superior preservation of key acoustic features essential for intelligibility.

Technology Category

Application Category

📝 Abstract

Recent advances in deep learning have significantly improved multichannel speech enhancement algorithms, yet conventional training loss functions such as the scale-invariant signal-to-distortion ratio (SDR) may fail to preserve fine-grained spectral cues essential for phoneme intelligibility. In this work, we propose perceptually-informed variants of the SDR loss, formulated in the time-frequency domain and modulated by frequency-dependent weighting schemes. These weights are designed to emphasize time-frequency regions where speech is prominent or where the interfering noise is particularly strong. We investigate both fixed and adaptive strategies, including ANSI band-importance weights, spectral magnitude-based weighting, and dynamic weighting based on the relative amount of speech and noise. We train the FaSNet multichannel speech enhancement model using these various losses. Experimental results show that while standard metrics such as the SDR are only marginally improved, their perceptual frequency-weighted counterparts exhibit a more substantial improvement. Besides, spectral and phoneme-level analysis indicates better consonant reconstruction, which points to a better preservation of certain acoustic cues.

Problem

Research questions and friction points this paper is trying to address.

Improving phoneme intelligibility in speech enhancement

Designing frequency-weighted SDR loss for better spectral cues

Enhancing consonant reconstruction with adaptive weighting strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-weighted SDR loss variants

Fixed and adaptive weighting strategies

Improved phoneme-level speech enhancement

🔎 Similar Papers

No similar papers found.