Speaking Without Sound: Multi-speaker Silent Speech Voicing with Facial Inputs Only

πŸ“… 2025-04-06
πŸ›οΈ IEEE International Conference on Acoustics, Speech, and Signal Processing
πŸ“ˆ Citations: 3
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the challenge of generating multi-speaker speech in the complete absence of audible input. To this end, the authors propose a silent speech synthesis method that fuses facial images with silent electromyography (EMG) signals: facial images are leveraged to match the target speaker’s vocal timbre, while linguistic content is extracted from EMG signals. A key innovation is the introduction of a pitch-disentangled content embedding mechanism, which effectively separates linguistic content from pitch information. This approach significantly enhances the naturalness and expressiveness of synthesized speech in multi-speaker scenarios and represents the first demonstration of high-quality multi-speaker silent speech synthesis using only facial images and EMG signals, thereby validating the efficacy of the proposed pitch-disentanglement strategy.

Technology Category

Application Category

πŸ“ Abstract
In this paper, we introduce a novel framework for generating multi-speaker speech without relying on any audible inputs. Our approach leverages silent electromyography (EMG) signals to capture linguistic content, while facial images are used to match with the vocal identity of the target speaker. Notably, we present a pitch-disentangled content embedding that enhances the extraction of linguistic content from EMG signals. Extensive analysis demonstrates that our method can generate multi-speaker speech without any audible inputs and confirms the effectiveness of the proposed pitch-disentanglement approach.
Problem

Research questions and friction points this paper is trying to address.

silent speech
multi-speaker
facial inputs
speech generation
voice identity
Innovation

Methods, ideas, or system contributions that make the work stand out.

silent speech
multi-speaker synthesis
EMG-based speech
pitch-disentangled embedding
facial identity
πŸ”Ž Similar Papers
No similar papers found.