Towards noise-robust speech inversion through multi-task learning with speech enhancement

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the significant degradation of speech inversion performance caused by background noise in real-world scenarios. To this end, we propose a novel multi-task joint modeling approach based on self-supervised learning (SSL) representations, which for the first time deeply integrates speech enhancement and speech inversion at the SSL representation level within an end-to-end shared network architecture. This design enables collaborative optimization of noise suppression and acoustic parameter reconstruction. Experimental results demonstrate that, under a −5 dB signal-to-noise ratio, the proposed method improves the average Pearson correlation coefficient of speech inversion by 80.95% and 38.98% over the baseline in babble and non-babble noise conditions, respectively, thereby strongly validating the effectiveness and innovation of the proposed task-mutual-benefit mechanism.

Technology Category

Application Category

📝 Abstract

Recent studies demonstrate the effectiveness of Self Supervised Learning (SSL) speech representations for Speech Inversion (SI). However, applying SI in real-world scenarios remains challenging due to the pervasive presence of background noise. We propose a unified framework that integrates Speech Enhancement (SE) and SI models through shared SSL-based speech representations. In this framework, the SSL model is trained not only to support the SE module in suppressing noise but also to produce representations that are more informative for the SI task, allowing both modules to benefit from joint training. At a Signal-to-Noise Ratio of -5 db, our method for the SI task achieves relative improvements over the baseline of 80.95% under babble noise and 38.98% under non-babble noise, as measured by the average Pearson product-moment correlation across all estimated parameters.

Problem

Research questions and friction points this paper is trying to address.

Speech Inversion

noise robustness

background noise

real-world scenarios

Signal-to-Noise Ratio

Innovation

Methods, ideas, or system contributions that make the work stand out.

speech inversion

speech enhancement

self-supervised learning