Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement

📅 2025-01-23

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address insufficient multimodal information utilization, phoneme confusion, and artifact generation in noisy speech enhancement, this paper proposes the first diffusion-based framework integrating audio, visual, and linguistic modalities. Our core innovation is a Cross-Modal Knowledge Transfer (CMKT) mechanism: during training, semantic priors from a pre-trained language model are injected into an audio-visual joint enhancement network via feature alignment and knowledge distillation; at inference, the language model is entirely omitted, enabling lightweight and efficient deployment. Experiments demonstrate substantial improvements over state-of-the-art methods across objective metrics—including PESQ (+1.22) and STOI (+3.8%)—while effectively suppressing generation artifacts. Visualization analyses further confirm precise integration and functional efficacy of linguistic knowledge.

Technology Category

Application Category

📝 Abstract

Speech Enhancement (SE) aims to improve the quality of noisy speech. It has been shown that additional visual cues can further improve performance. Given that speech communication involves audio, visual, and linguistic modalities, it is natural to expect another performance boost by incorporating linguistic information. However, bridging the modality gaps to efficiently incorporate linguistic information, along with audio and visual modalities during knowledge transfer, is a challenging task. In this paper, we propose a novel multi-modality learning framework for SE. In the model framework, a state-of-the-art diffusion Model backbone is utilized for Audio-Visual Speech Enhancement (AVSE) modeling where both audio and visual information are directly captured by microphones and video cameras. Based on this AVSE, the linguistic modality employs a PLM to transfer linguistic knowledge to the visual acoustic modality through a process termed Cross-Modal Knowledge Transfer (CMKT) during AVSE model training. After the model is trained, it is supposed that linguistic knowledge is encoded in the feature processing of the AVSE model by the CMKT, and the PLM will not be involved during inference stage. We carry out SE experiments to evaluate the proposed model framework. Experimental results demonstrate that our proposed AVSE system significantly enhances speech quality and reduces generative artifacts, such as phonetic confusion compared to the state-of-the-art. Moreover, our visualization results demonstrate that our Cross-Modal Knowledge Transfer method further improves the generated speech quality of our AVSE system. These findings not only suggest that Diffusion Model-based techniques hold promise for advancing the state-of-the-art in AVSE but also justify the effectiveness of incorporating linguistic information to improve the performance of Diffusion-based AVSE systems.

Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Fusion

Speech Enhancement

Noise Reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Integration

Speech Clarity Enhancement

Textual Information Retention

🔎 Similar Papers

Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?