🤖 AI Summary
This work addresses the challenge of building generalizable EEG decoding models, which conventional approaches struggle with due to their reliance on task-specific data. The authors propose a novel cross-modal analysis framework that requires no fine-tuning: multi-channel EEG signals are transformed into stacked waveform images, and neuroscience-informed textual prompts are constructed to leverage off-the-shelf vision-language models (VLMs) for seizure detection. To enhance inference, the method integrates retrieval-augmented in-context learning (RAICL), which dynamically selects relevant examples to guide predictions. This study presents the first integration of VLMs with RAICL for EEG analysis, achieving performance on par with or superior to traditional temporal models in seizure detection tasks, thereby demonstrating both its effectiveness and potential for clinical deployment.
📝 Abstract
Electroencephalogram (EEG) decoding is a critical component of medical diagnostics, rehabilitation engineering, and brain-computer interfaces. However, contemporary decoding methodologies remain heavily dependent on task-specific datasets to train specialized neural network architectures. Consequently, limited data availability impedes the development of generalizable large brain decoding models. In this work, we propose a paradigm shift from conventional signal-based decoding by leveraging large-scale vision-language models (VLMs) to analyze EEG waveform plots. By converting multivariate EEG signals into stacked waveform images and integrating neuroscience domain expertise into textual prompts, we demonstrate that foundational VLMs can effectively differentiate between different patterns in the human brain. To address the inherent non-stationarity of EEG signals, we introduce a Retrieval-Augmented In-Context Learning (RAICL) approach, which dynamically selects the most representative and relevant few-shot examples to condition the autoregressive outputs of the VLM. Experiments on EEG-based seizure detection indicate that state-of-the-art VLMs under RAICL achieved better or comparable performance with traditional time series based approaches. These findings suggest a new direction in physiological signal processing that effectively bridges the modalities of vision, language, and neural activities. Furthermore, the utilization of off-the-shelf VLMs, without the need for retraining or downstream architecture construction, offers a readily deployable solution for clinical applications.