🤖 AI Summary
This work addresses the gap in evaluating EEG foundation models under realistic biomedical constraints—such as limited labeled data, restricted electrode channels, and the need for parameter-efficient adaptation—which are often overlooked when relying solely on full fine-tuning with high-quality datasets. The study proposes the first multidimensional evaluation framework tailored to practical deployment scenarios, systematically benchmarking prominent EEG foundation models (e.g., LaBraM, CSBrain, CBraMod) against supervised baselines across six datasets. Evaluations span low-resource settings, short- and long-duration tasks, and varying sensor configurations. Results reveal that foundation models excel significantly in long-context tasks like sleep staging, yet in short-window brain–computer interface applications, lightweight supervised models achieve comparable performance, thereby delineating the current applicability boundaries and suggesting targeted directions for future model refinement.
📝 Abstract
Evaluating foundation models under appropriate adaptation settings is essential for understanding the quality and transferability of the learned representations. Recent EEG foundation models have demonstrated promising transfer capabilities across tasks and datasets, motivating their growing use in neurotechnology and clinical applications. However, these models are typically evaluated under full fine-tuning on well-curated downstream datasets, a setting that does not reflect biomedical domain constraints such as limited labeled data, reduced sensor coverage, or parameter-efficient adaptation. In this work, we propose a multi-dimensional evaluation framework for assessing EEG models under realistic low-resource conditions. Empirical analysis of both supervised EEG models and recent EEG foundation models, including LaBraM, CSBrain, and CBraMod, across 6 different datasets is performed under the proposed multi-dimensional evaluation framework. We find that EEG foundation models consistently provide performance gains on long-context tasks such as sleep stage prediction and mental health state classification. In contrast, for short-window Brain Computer Interface style tasks, supervised models achieve comparable despite having substantially fewer parameters. Additional analyses demonstrate that current foundation models provide limited robustness to short-window tasks and channel constrained settings. Together, these findings motivate the use of multi-dimensional evaluation protocols that characterize model behavior under realistic use constraints.