Speaking Without Sound: Multi-speaker Silent Speech Voicing with Facial Inputs Only

📅 2025-04-06

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 3

✨ Influential: 1

career value

216K/year

🤖 AI Summary

This study addresses the challenge of generating multi-speaker speech in the complete absence of audible input. To this end, the authors propose a silent speech synthesis method that fuses facial images with silent electromyography (EMG) signals: facial images are leveraged to match the target speaker’s vocal timbre, while linguistic content is extracted from EMG signals. A key innovation is the introduction of a pitch-disentangled content embedding mechanism, which effectively separates linguistic content from pitch information. This approach significantly enhances the naturalness and expressiveness of synthesized speech in multi-speaker scenarios and represents the first demonstration of high-quality multi-speaker silent speech synthesis using only facial images and EMG signals, thereby validating the efficacy of the proposed pitch-disentanglement strategy.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce a novel framework for generating multi-speaker speech without relying on any audible inputs. Our approach leverages silent electromyography (EMG) signals to capture linguistic content, while facial images are used to match with the vocal identity of the target speaker. Notably, we present a pitch-disentangled content embedding that enhances the extraction of linguistic content from EMG signals. Extensive analysis demonstrates that our method can generate multi-speaker speech without any audible inputs and confirms the effectiveness of the proposed pitch-disentanglement approach.

Problem

Research questions and friction points this paper is trying to address.

silent speech

multi-speaker

facial inputs

speech generation

voice identity

Innovation

Methods, ideas, or system contributions that make the work stand out.

silent speech

multi-speaker synthesis

EMG-based speech