Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

📅 2024-12-25

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Existing NAM-to-speech methods suffer from low speech intelligibility, poor cross-speaker generalization, excessive reliance on low-quality whisper samples, and lack of visual modality integration. This paper proposes the first lip-motion-driven diffusion-based NAM-to-speech framework, departing from conventional Whisper-simulation paradigms toward phoneme-level alignment modeling and multimodal reconstruction. Key contributions include: (1) introducing MultiNAM—the first four-modal aligned dataset (7.96 hours) comprising video, text, NAM signals, and Whisper transcriptions; and (2) integrating self-supervised NAM representation learning, TTS-guided phoneme alignment, lip-conditioned diffusion-based speech generation, and multimodal feature fusion. Evaluated on MultiNAM, our method achieves a 32% WER reduction and a MOS of 3.8, while enabling zero-shot cross-speaker conversion. All code, models, and the MultiNAM dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over $7.96$ hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and the dataset are available at url{https://diff-nam.github.io/DiffNAM/}

Problem

Research questions and friction points this paper is trying to address.

Whisper-to-speech conversion

Speech clarity and adaptability

Visual aid dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Whisper-to-Speech Conversion

Lip-Reading Assisted Learning

MultiNAM Dataset

🔎 Similar Papers

No similar papers found.