π€ AI Summary
This study addresses the challenge of generating multi-speaker speech in the complete absence of audible input. To this end, the authors propose a silent speech synthesis method that fuses facial images with silent electromyography (EMG) signals: facial images are leveraged to match the target speakerβs vocal timbre, while linguistic content is extracted from EMG signals. A key innovation is the introduction of a pitch-disentangled content embedding mechanism, which effectively separates linguistic content from pitch information. This approach significantly enhances the naturalness and expressiveness of synthesized speech in multi-speaker scenarios and represents the first demonstration of high-quality multi-speaker silent speech synthesis using only facial images and EMG signals, thereby validating the efficacy of the proposed pitch-disentanglement strategy.
π Abstract
In this paper, we introduce a novel framework for generating multi-speaker speech without relying on any audible inputs. Our approach leverages silent electromyography (EMG) signals to capture linguistic content, while facial images are used to match with the vocal identity of the target speaker. Notably, we present a pitch-disentangled content embedding that enhances the extraction of linguistic content from EMG signals. Extensive analysis demonstrates that our method can generate multi-speaker speech without any audible inputs and confirms the effectiveness of the proposed pitch-disentanglement approach.