🤖 AI Summary
Traditional SDR loss struggles to preserve phoneme intelligibility in multi-channel speech enhancement due to its inability to retain fine-grained spectral cues critical for articulation. To address this, we propose a perception-driven frequency-domain weighted SDR loss. Our method integrates dual-path weighting: a fixed component based on ANSI standard spectral weighting and an adaptive component dynamically estimated from spectral magnitude and time-frequency signal-to-noise ratio, thereby emphasizing reconstruction accuracy in phonetically salient and heavily noise-corrupted regions. Integrated into the FaSNet architecture for end-to-end training, the proposed loss yields only a marginal gain in conventional SDR (+0.15 dB) but achieves a substantial 2.3 dB improvement in weighted SDR and an 8.7% increase in consonant recognition accuracy. Spectral and phoneme-level analyses confirm significantly enhanced reconstruction fidelity for transient consonants—e.g., [ts] and [k]—demonstrating superior preservation of key acoustic features essential for intelligibility.
📝 Abstract
Recent advances in deep learning have significantly improved multichannel speech enhancement algorithms, yet conventional training loss functions such as the scale-invariant signal-to-distortion ratio (SDR) may fail to preserve fine-grained spectral cues essential for phoneme intelligibility. In this work, we propose perceptually-informed variants of the SDR loss, formulated in the time-frequency domain and modulated by frequency-dependent weighting schemes. These weights are designed to emphasize time-frequency regions where speech is prominent or where the interfering noise is particularly strong. We investigate both fixed and adaptive strategies, including ANSI band-importance weights, spectral magnitude-based weighting, and dynamic weighting based on the relative amount of speech and noise. We train the FaSNet multichannel speech enhancement model using these various losses. Experimental results show that while standard metrics such as the SDR are only marginally improved, their perceptual frequency-weighted counterparts exhibit a more substantial improvement. Besides, spectral and phoneme-level analysis indicates better consonant reconstruction, which points to a better preservation of certain acoustic cues.