๐ค AI Summary
This work proposes a novel multimodal speech enhancement framework based on a conditional diffusion model to address the significant performance degradation of single-channel systems in extremely noisy environments and the ongoing challenge of effectively fusing bone-conducted (BC) and air-conducted (AC) signals. For the first time, the noise-robust BC signal is incorporated as a conditioning cue to guide the diffusion process, enabling joint optimization with the AC speech signal. The proposed method achieves efficient integration of multimodal information and consistently outperforms both state-of-the-art multimodal approaches and unimodal diffusion baselines across various complex noise conditions. Experimental results validate the effectiveness and innovation of the proposed architecture in enhancing the robustness of speech enhancement systems.
๐ Abstract
Single-channel speech enhancement models face significant performance degradation in extremely noisy environments. While prior work has shown that complementary bone-conducted speech can guide enhancement, effective integration of this noise-immune modality remains a challenge. This paper introduces a novel multimodal speech enhancement framework that integrates bone-conduction sensors with air-conducted microphones using a conditional diffusion model. Our proposed model significantly outperforms previously established multimodal techniques and a powerful diffusion-based single-modal baseline across a wide range of acoustic conditions.