USpeech: Ultrasound-Enhanced Speech with Minimal Human Effort via Cross-Modal Synthesis

📅 2024-10-29

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing ultrasound-based speech enhancement methods are hindered by the scarcity of real ultrasound-audio paired data, as acquisition is highly susceptible to interference and requires costly manual annotation. To address this, we propose a cross-modal ultrasound synthesis framework featuring a novel two-stage paradigm bridged by audio: (1) contrastive video-audio pretraining to establish visuo-phonetic alignment, followed by (2) an audio-to-ultrasound encoder-decoder to generate high-fidelity synthetic ultrasound signals. These synthesized signals then drive a time-frequency domain enhancement network coupled with a neural vocoder for end-to-end speech restoration. Our approach overcomes the heterogeneity between video and ultrasound modalities and achieves performance on par with physically acquired data—despite using only synthetic ultrasound. It significantly outperforms prior methods in PESQ and STOI, and the synthesized ultrasound quality is comparable to real acquisitions. The complete codebase is publicly released.

Technology Category

Application Category

📝 Abstract

Speech enhancement is crucial for ubiquitous human-computer interaction. Recently, ultrasound-based acoustic sensing has emerged as an attractive choice for speech enhancement because of its superior ubiquity and performance. However, due to inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition, existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USpeech, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human effort. At its core is a two-stage framework that establishes the correspondence between visual and ultrasonic modalities by leveraging audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound datasets and the inherent heterogeneity between video and ultrasound data. Our framework incorporates contrastive video-audio pre-training to project modalities into a shared semantic space and employs an audio-ultrasound encoder-decoder for ultrasound synthesis. We then present a speech enhancement network that enhances speech in the time-frequency domain and recovers the clean speech waveform via a neural vocoder. Comprehensive experiments show USpeech achieves remarkable performance using synthetic ultrasound data comparable to physical data, outperforming state-of-the-art ultrasound-based speech enhancement baselines. USpeech is open-sourced at https://github.com/aiot-lab/USpeech/.

Problem

Research questions and friction points this paper is trying to address.

Minimizes human effort in ultrasound-based speech enhancement

Overcomes lack of paired video-ultrasound datasets

Enhances speech quality using synthetic ultrasound data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage cross-modal ultrasound synthesis framework

Contrastive video-audio pre-training for shared semantics

Neural vocoder for speech waveform recovery

🔎 Similar Papers

No similar papers found.