🤖 AI Summary
This study addresses the challenge of distribution shift in non-invasive magnetoencephalography (MEG) signals for speech decoding, which severely limits model generalization. Leveraging the LibriBrain phoneme classification benchmark, the authors systematically evaluate the impact of various model architectures—including residual CNNs, STFT-CNNs, CNN-Transformer hybrids, and MEGConformer—as well as data-centric strategies on robustness. Their findings indicate that instance normalization effectively mitigates distribution shifts, and that data preprocessing choices—such as label balancing and grouping strategies—play a more decisive role than model complexity. Saliency map analysis further reveals representational discrepancies across dataset splits. The best-performing custom CNN achieves a macro F1-score of 60.95%, a 21.42-point improvement over the baseline, while MEGConformer consistently attains 64.09% macro F1 on both validation and test sets.
📝 Abstract
This study investigates robust speech-related decoding from non-invasive MEG signals using the LibriBrain phoneme-classification benchmark from the 2025 PNPL competition. We compare residual convolutional neural networks (CNNs), an STFT-based CNN, and a CNN--Transformer hybrid, while also examining the effects of group averaging, label balancing, repeated grouping, normalization strategies, and data augmentation. Across our in-house implementations, preprocessing and data-configuration choices matter more than additional architectural complexity, among which instance normalization emerges as the most influential modification for generalization. The strongest of our own models, a CNN with group averaging, label balancing, repeated grouping, and instance normalization, achieves 60.95% F1-macro on the test split, compared with 39.53% for the plain CNN baseline. However, most of our models, without instance normalization, show substantial validation-to-test degradation, indicating that distribution shift induced by different normalization statistics is a major obstacle to generalization in our experiments. By contrast, MEGConformer maintains 64.09% F1-macro on both validation and test, and saliency-map analysis is qualitatively consistent with this contrast: weaker models exhibit more concentrated or repetitive phoneme-sensitive patterns across splits, whereas MEGConformer appears more distributed. Overall, the results suggest that improving the reliability of non-invasive phoneme decoding will likely require better handling of normalization-related distribution shift while also addressing the challenge of single-trial decoding.