ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement

📅 2024-07-28

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

234K/year

🤖 AI Summary

Neural speech enhancement models often suffer from poor generalization in real-world far-field scenarios. To address this, this paper proposes a near-field guided pseudo-label training paradigm. Leveraging real-recorded near-field–far-field speech pairs, it first trains a high-fidelity speech enhancement model on near-field data to generate high-quality pseudo-clean labels for corresponding far-field mixtures. These pseudo-labels then supervise the end-to-end training of a far-field enhancement model. Crucially, the approach eliminates the need for ground-truth far-field clean speech, enabling—for the first time—a purely real-data-driven far-field speech enhancement training framework. By bypassing synthetic data, it effectively bridges the distributional gap between simulated and real acoustic domains. Experiments on the CHiME-4 real-world dataset demonstrate that the generated pseudo-labels achieve high fidelity, and the resulting far-field model significantly outperforms conventional simulation-supervised baselines in terms of objective and perceptual metrics.

Technology Category

Application Category

📝 Abstract

The current dominant approach for neural speech enhancement is via purely-supervised deep learning on simulated pairs of far-field noisy-reverberant speech (i.e., mixtures) and clean speech. The trained models, however, often exhibit limited generalizability to real-recorded mixtures. To deal with this, this paper investigates training enhancement models directly on real mixtures. However, a major difficulty challenging this approach is that, since the clean speech of real mixtures is unavailable, there lacks a good supervision for real mixtures. In this context, assuming that a training set consisting of real-recorded pairs of close-talk and far-field mixtures is available, we propose to address this difficulty via close-talk speech enhancement, where an enhancement model is first trained on simulated mixtures to enhance real-recorded close-talk mixtures and the estimated close-talk speech can then be utilized as a supervision (i.e., pseudo-label) for training far-field speech enhancement models directly on the paired real-recorded far-field mixtures. We name the proposed system $ extit{ctPuLSE}$. Evaluation results on the CHiME-4 dataset show that ctPuLSE can derive high-quality pseudo-labels and yield far-field speech enhancement models with strong generalizability to real data.

Problem

Research questions and friction points this paper is trying to address.

Enhancing far-field speech using real mixtures without clean references

Leveraging close-talk speech as pseudo-labels for supervision

Improving model generalizability to real-recorded noisy environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Close-talk enhancement for pseudo-label generation

Pseudo-labels supervise far-field model training

Training directly on real-recorded far-field mixtures

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs