π€ AI Summary
Existing NAM-to-speech methods suffer from low speech intelligibility, poor cross-speaker generalization, excessive reliance on low-quality whisper samples, and lack of visual modality integration. This paper proposes the first lip-motion-driven diffusion-based NAM-to-speech framework, departing from conventional Whisper-simulation paradigms toward phoneme-level alignment modeling and multimodal reconstruction. Key contributions include: (1) introducing MultiNAMβthe first four-modal aligned dataset (7.96 hours) comprising video, text, NAM signals, and Whisper transcriptions; and (2) integrating self-supervised NAM representation learning, TTS-guided phoneme alignment, lip-conditioned diffusion-based speech generation, and multimodal feature fusion. Evaluated on MultiNAM, our method achieves a 32% WER reduction and a MOS of 3.8, while enabling zero-shot cross-speaker conversion. All code, models, and the MultiNAM dataset are publicly released.
π Abstract
Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over $7.96$ hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and the dataset are available at url{https://diff-nam.github.io/DiffNAM/}