A Systematic Evaluation of Sample-Level Tokenization Strategies for MEG Foundation Models

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic evaluation of tokenization strategies for magnetoencephalography (MEG) signals in current large-scale neuroimaging foundation models, which impacts both model performance and biological plausibility. The work presents the first comprehensive assessment of sample-level tokenization approaches in MEG foundation models, comparing learnable (a novel autoencoder-based tokenizer) and non-learnable strategies across multiple criteria: signal reconstruction, token prediction accuracy, biological plausibility, preservation of individual-specific information, and downstream task performance. Experiments on three public MEG datasets demonstrate that both approaches achieve comparable performance across most metrics and attain high reconstruction fidelity, suggesting that fixed tokenization schemes are sufficient to support the development of efficient and biologically plausible neural foundation models.

Technology Category

Application Category

📝 Abstract
Recent success in natural language processing has motivated growing interest in large-scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as 'tokenization'. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample-level tokenization strategies for transformer-based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non-learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject-specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquisition sites, scanners, and experimental paradigms. Our results show that both learnable and non-learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting that simple fixed sample-level tokenization strategies can be used in the development of neural foundation models. The code is available at https://github.com/OHBA-analysis/Cho2026_Tokenizer.
Problem

Research questions and friction points this paper is trying to address.

tokenization
MEG
foundation models
neuroimaging
discretization
Innovation

Methods, ideas, or system contributions that make the work stand out.

tokenization
foundation models
MEG
autoencoder
neuroimaging
🔎 Similar Papers
No similar papers found.
S
SungJun Cho
Oxford Centre for Human Brain Activity, University of Oxford, Oxford OX3 7JX, U.K.; also with the Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford OX3 9DU, U.K.
Chetan Gohil
Chetan Gohil
University of Sydney
Computational NeuroscienceAccelerator Physics
R
Rukuang Huang
Oxford Centre for Human Brain Activity, University of Oxford, Oxford OX3 7JX, U.K.; also with the Department of Psychiatry, University of Oxford, Oxford OX3 7JX, U.K.
Oiwi Parker Jones
Oiwi Parker Jones
Applied Artificial Intelligence and Clinical Neurosciences, University of Oxford
AINeuroscienceDeep LearningSpeech RecognitionLanguage Documentation
M
Mark W. Woolrich
Oxford Centre for Human Brain Activity, University of Oxford, Oxford OX3 7JX, U.K.; also with the Department of Psychiatry, University of Oxford, Oxford OX3 7JX, U.K.