๐ค AI Summary
This study addresses the lack of systematic evaluation under unified standards for existing EEG foundation models, which hinders reliable assessment of their generalization capabilities and application potential. We establish the first comprehensive benchmark encompassing data standardization, architectural design, and self-supervised learning strategies, evaluating 12 open-source models across 13 datasets and 9 brainโcomputer interface paradigms under cross-subject and few-shot calibration settings. Through comparative experiments involving linear probing versus full fine-tuning and leave-one-subject-out cross-validation, we find that linear probes often fail to effectively transfer learned representations, that task-specific models trained from scratch remain competitive, and that increasing model scale does not necessarily improve generalization under current conditions. This work provides a robust evaluation framework and empirical foundation for advancing EEG foundation models.
๐ Abstract
Electroencephalography (EEG) foundation models have recently emerged as a promising paradigm for brain-computer interfaces (BCIs), aiming to learn transferable neural representations from large-scale heterogeneous recordings. Despite rapid progresses, there lacks fair and comprehensive comparisons of existing EEG foundation models, due to inconsistent pre-training objectives, preprocessing choices, and downstream evaluation protocols. This paper fills this gap. We first review 50 representative models and organize their design choices into a unified taxonomic framework including data standardization, model architectures, and self-supervised pre-training strategies. We then evaluate 12 open-source foundation models and competitive specialist baselines across 13 EEG datasets spanning nine BCI paradigms. Emphasizing real-world deployments, we consider both cross-subject generalization under a leave-one-subject-out protocol and rapid calibration under a within-subject few-shot setting. We further compare full-parameter fine-tuning with linear probing to assess the transferability of pre-trained representations, and examine the relationship between model scale and downstream performance. Our results indicate that: 1) linear probing is frequently insufficient; 2) specialist models trained from scratch remain competitive across many tasks; and, 3) larger foundation models do not necessarily yield better generalization performance under current data regimes and training practices.