🤖 AI Summary
To address speech decoding for paralyzed individuals, this work pioneers cross-modal knowledge transfer from the audio self-supervised model Wav2Vec2 to electroencephalography (EEG)-based neural signal decoding. Methodologically, we replace Wav2Vec2’s original audio frontend with a learnable Brain Feature Extractor (BFE), yielding an end-to-end sequence-to-sequence model mapping neural signals directly to text. We systematically evaluate three transfer paradigms: full fine-tuning, training from scratch, and freezing the backbone. Our key contributions are: (1) the first empirical validation of Wav2Vec2’s self-supervised representations’ transferability to neural signals; (2) the design of the BFE module for effective modality adaptation; and (3) compelling evidence that pretraining substantially improves performance—full fine-tuning achieves a character error rate (CER) of 18.54%, outperforming training from scratch and frozen-backbone baselines by 20.46 and 15.92 percentage points, respectively, and significantly surpassing prior state-of-the-art.
📝 Abstract
The decoding of continuously spoken speech from neuronal activity has the potential to become an important clinical solution for paralyzed patients. Deep Learning Brain Computer Interfaces (BCIs) have recently successfully mapped neuronal activity to text contents in subjects who attempted to formulate speech. However, only small BCI datasets are available. In contrast, labeled data and pre-trained models for the closely related task of speech recognition from audio are widely available. One such model is Wav2Vec2 which has been trained in a self-supervised fashion to create meaningful representations of speech audio data. In this study, we show that patterns learned by Wav2Vec2 are transferable to brain data. Specifically, we replace its audio feature extractor with an untrained Brain Feature Extractor (BFE) model. We then execute full fine-tuning with pre-trained weights for Wav2Vec2, training ''from scratch'' without pre-trained weights as well as freezing a pre-trained Wav2Vec2 and training only the BFE each for 45 different BFE architectures. Across these experiments, the best run is from full fine-tuning with pre-trained weights, achieving a Character Error Rate (CER) of 18.54%, outperforming the best training from scratch run by 20.46% and that of frozen Wav2Vec2 training by 15.92% percentage points. These results indicate that knowledge transfer from audio speech recognition to brain decoding is possible and significantly improves brain decoding performance for the same architectures. Related source code is available at https://github.com/tfiedlerdev/Wav2Vec2ForBrain.