🤖 AI Summary
This study addresses the lack of systematic evaluation of tokenization strategies for magnetoencephalography (MEG) signals in current large-scale neuroimaging foundation models, which impacts both model performance and biological plausibility. The work presents the first comprehensive assessment of sample-level tokenization approaches in MEG foundation models, comparing learnable (a novel autoencoder-based tokenizer) and non-learnable strategies across multiple criteria: signal reconstruction, token prediction accuracy, biological plausibility, preservation of individual-specific information, and downstream task performance. Experiments on three public MEG datasets demonstrate that both approaches achieve comparable performance across most metrics and attain high reconstruction fidelity, suggesting that fixed tokenization schemes are sufficient to support the development of efficient and biologically plausible neural foundation models.
📝 Abstract
Recent success in natural language processing has motivated growing interest in large-scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as 'tokenization'. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample-level tokenization strategies for transformer-based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non-learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject-specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquisition sites, scanners, and experimental paradigms. Our results show that both learnable and non-learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting that simple fixed sample-level tokenization strategies can be used in the development of neural foundation models. The code is available at https://github.com/OHBA-analysis/Cho2026_Tokenizer.