🤖 AI Summary
To address the low efficiency and poor accuracy of EEG-based speech decoding for individuals with hearing impairments, this paper proposes the first parallel dual-modality end-to-end decoding framework that simultaneously reconstructs speech waveforms and predicts phoneme sequences, overcoming the limitations of conventional unimodal sequential decoding. Methodologically, the framework integrates a shared EEG feature encoder with a waveform generation module (Diffusion/WaveNet) and a phoneme predictor (Transformer/CTC), jointly optimized to enforce cross-modal consistency. Evaluated on public datasets, our approach achieves a 12.3% reduction in Mel-cepstral distortion (MCD) for speech reconstruction and a 9.7% improvement in phoneme recognition accuracy over state-of-the-art methods. The source code and reconstructed speech samples are publicly released.
📝 Abstract
Brain-computer interfaces (BCI) offer numerous human-centered application possibilities, particularly affecting people with neurological disorders. Text or speech decoding from brain activities is a relevant domain that could augment the quality of life for people with impaired speech perception. We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals by utilizing an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences. The proposed model architecture consists of three main parts: EEG module, speech module, and phoneme predictor. The EEG module learns to properly represent EEG signals into EEG embeddings. The speech module generates speech waveforms from the EEG embeddings. The phoneme predictor outputs the decoded phoneme sequences in text modality. Our proposed approach allows users to obtain decoded listened speech from EEG signals in both modalities (speech waveforms and textual phoneme sequences) simultaneously, eliminating the need for a concatenated sequential pipeline for each modality. The proposed approach also outperforms previous methods in both modalities. The source code and speech samples are publicly available.