🤖 AI Summary
Conventional speech enhancement evaluation relies predominantly on utterance-level metrics, overlooking acoustic variations across phoneme categories and speaker gender. Method: This work introduces the first systematic, phoneme-granular analysis of multi-channel speech enhancement algorithms, integrating gender- and phoneme-specific spectral characteristics with multidimensional evaluation—combining perceptual metrics (PESQ, STOI) and automatic speech recognition (ASR) accuracy. Contribution/Results: Experiments reveal that state-of-the-art algorithms exhibit significantly stronger interference suppression and fewer artifacts for female speech—particularly in stops, fricatives, and vowels—yielding consistently higher perceptual quality and ASR accuracy compared to male speech. These findings expose critical limitations of utterance-level evaluation paradigms and establish a novel, fine-grained evaluation framework grounded in phonemic and speaker-specific acoustic structure, thereby informing the design and assessment of next-generation speech enhancement systems.
📝 Abstract
Multichannel speech enhancement algorithms are essential for improving the intelligibility of speech signals in noisy environments. These algorithms are usually evaluated at the utterance level, but this approach overlooks the disparities in acoustic characteristics that are observed in different phoneme categories and between male and female speakers. In this paper, we investigate the impact of gender and phonetic content on speech enhancement algorithms. We motivate this approach by outlining phoneme- and gender-specific spectral features. Our experiments reveal that while utterance-level differences between genders are minimal, significant variations emerge at the phoneme level. Results show that the tested algorithms better reduce interference with fewer artifacts on female speech, particularly in plosives, fricatives, and vowels. Additionally, they demonstrate greater performance for female speech in terms of perceptual and speech recognition metrics.