🤖 AI Summary
To address insufficient multimodal information utilization, phoneme confusion, and artifact generation in noisy speech enhancement, this paper proposes the first diffusion-based framework integrating audio, visual, and linguistic modalities. Our core innovation is a Cross-Modal Knowledge Transfer (CMKT) mechanism: during training, semantic priors from a pre-trained language model are injected into an audio-visual joint enhancement network via feature alignment and knowledge distillation; at inference, the language model is entirely omitted, enabling lightweight and efficient deployment. Experiments demonstrate substantial improvements over state-of-the-art methods across objective metrics—including PESQ (+1.22) and STOI (+3.8%)—while effectively suppressing generation artifacts. Visualization analyses further confirm precise integration and functional efficacy of linguistic knowledge.
📝 Abstract
Speech Enhancement (SE) aims to improve the quality of noisy speech. It has been shown that additional visual cues can further improve performance. Given that speech communication involves audio, visual, and linguistic modalities, it is natural to expect another performance boost by incorporating linguistic information. However, bridging the modality gaps to efficiently incorporate linguistic information, along with audio and visual modalities during knowledge transfer, is a challenging task. In this paper, we propose a novel multi-modality learning framework for SE. In the model framework, a state-of-the-art diffusion Model backbone is utilized for Audio-Visual Speech Enhancement (AVSE) modeling where both audio and visual information are directly captured by microphones and video cameras. Based on this AVSE, the linguistic modality employs a PLM to transfer linguistic knowledge to the visual acoustic modality through a process termed Cross-Modal Knowledge Transfer (CMKT) during AVSE model training. After the model is trained, it is supposed that linguistic knowledge is encoded in the feature processing of the AVSE model by the CMKT, and the PLM will not be involved during inference stage. We carry out SE experiments to evaluate the proposed model framework. Experimental results demonstrate that our proposed AVSE system significantly enhances speech quality and reduces generative artifacts, such as phonetic confusion compared to the state-of-the-art. Moreover, our visualization results demonstrate that our Cross-Modal Knowledge Transfer method further improves the generated speech quality of our AVSE system. These findings not only suggest that Diffusion Model-based techniques hold promise for advancing the state-of-the-art in AVSE but also justify the effectiveness of incorporating linguistic information to improve the performance of Diffusion-based AVSE systems.