Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

πŸ“… 2024-12-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing NAM-to-speech methods suffer from low speech intelligibility, poor cross-speaker generalization, excessive reliance on low-quality whisper samples, and lack of visual modality integration. This paper proposes the first lip-motion-driven diffusion-based NAM-to-speech framework, departing from conventional Whisper-simulation paradigms toward phoneme-level alignment modeling and multimodal reconstruction. Key contributions include: (1) introducing MultiNAMβ€”the first four-modal aligned dataset (7.96 hours) comprising video, text, NAM signals, and Whisper transcriptions; and (2) integrating self-supervised NAM representation learning, TTS-guided phoneme alignment, lip-conditioned diffusion-based speech generation, and multimodal feature fusion. Evaluated on MultiNAM, our method achieves a 32% WER reduction and a MOS of 3.8, while enabling zero-shot cross-speaker conversion. All code, models, and the MultiNAM dataset are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over $7.96$ hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and the dataset are available at url{https://diff-nam.github.io/DiffNAM/}
Problem

Research questions and friction points this paper is trying to address.

Whisper-to-speech conversion
Speech clarity and adaptability
Visual aid dependency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Whisper-to-Speech Conversion
Lip-Reading Assisted Learning
MultiNAM Dataset
πŸ”Ž Similar Papers
No similar papers found.
N
Neil Shah
CVIT, IIIT Hyderabad, India
S
S. Karande
TCS Research Pune, India
Vineet Gandhi
Vineet Gandhi
Associate Professor at IIIT Hyderabad
Creative AIApplied Machine LearningMultimedia