Linguistic Knowledge Transfer Learning for Speech Enhancement

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient linguistic knowledge exploitation and misalignment between acoustic and semantic representations in noisy speech enhancement, this paper proposes a Cross-Modal Knowledge Transfer (CMKT) framework. CMKT requires no textual input, no speech-text alignment, and no inference-time invocation of large language models (LLMs); instead, it implicitly injects semantic priors via cross-modal knowledge distillation and pretrained language model embeddings. To enhance feature robustness, we introduce temporal misalignment regularization and design an acoustic–linguistic disentangled modeling mechanism. Evaluated on Mandarin and English datasets, CMKT significantly improves speech intelligibility and enhancement quality across diverse noise conditions. It is architecture-agnostic—compatible with mainstream speech enhancement (SE) backbones—and supports multilingual, text-free deployment. Empirical results demonstrate consistent and substantial gains over state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
Linguistic knowledge plays a crucial role in spoken language comprehension. It provides essential semantic and syntactic context for speech perception in noisy environments. However, most speech enhancement (SE) methods predominantly rely on acoustic features to learn the mapping relationship between noisy and clean speech, with limited exploration of linguistic integration. While text-informed SE approaches have been investigated, they often require explicit speech-text alignment or externally provided textual data, constraining their practicality in real-world scenarios. Additionally, using text as input poses challenges in aligning linguistic and acoustic representations due to their inherent differences. In this study, we propose the Cross-Modality Knowledge Transfer (CMKT) learning framework, which leverages pre-trained large language models (LLMs) to infuse linguistic knowledge into SE models without requiring text input or LLMs during inference. Furthermore, we introduce a misalignment strategy to improve knowledge transfer. This strategy applies controlled temporal shifts, encouraging the model to learn more robust representations. Experimental evaluations demonstrate that CMKT consistently outperforms baseline models across various SE architectures and LLM embeddings, highlighting its adaptability to different configurations. Additionally, results on Mandarin and English datasets confirm its effectiveness across diverse linguistic conditions, further validating its robustness. Moreover, CMKT remains effective even in scenarios without textual data, underscoring its practicality for real-world applications. By bridging the gap between linguistic and acoustic modalities, CMKT offers a scalable and innovative solution for integrating linguistic knowledge into SE models, leading to substantial improvements in both intelligibility and enhancement performance.
Problem

Research questions and friction points this paper is trying to address.

Integrates linguistic knowledge into speech enhancement without text input.
Improves speech enhancement robustness across diverse linguistic conditions.
Enhances speech intelligibility and performance in noisy environments.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained large language models for linguistic integration
Introduces misalignment strategy for robust knowledge transfer
Operates without text input or LLMs during inference
🔎 Similar Papers
No similar papers found.