SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments

📅 2025-01-27

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Existing brain–computer interface models suffer from poor generalization to novel subjects and stimuli due to training–testing data homogeneity, while substantial inter-subject variability in cortical topology severely hinders cross-subject neural decoding. To address this, we propose Surface Vision Transformer (Surface ViT), a geometric deep learning architecture that explicitly models dynamic cortical topology, integrated with an fMRI–video–audio multimodal self-supervised contrastive alignment framework. This enables zero-shot bidirectional brain–content decoding across both unseen subjects and unseen movie clips. Evaluated on the HCP dataset (174 subjects, 7T fMRI during naturalistic movie viewing), our method achieves accurate identification of neural responses from previously unobserved subjects and reconstruction of audiovisual content from previously unobserved clips—without requiring any subject- or clip-specific labels. Attention visualization reveals pronounced individual specificity in semantic and visual cortical representations. To our knowledge, this is the first work to achieve highly generalizable, multimodally aligned, cross-subject bidirectional brain–content decoding.

Technology Category

Application Category

📝 Abstract

Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets. This limits their utility for brain computer interfaces (BCI) or neurofeedback, for which it would be useful to pool experiences across individuals to better simulate stimuli not sampled during training. A key obstacle to model generalisation is the degree of variability of inter-subject cortical organisation, which makes it difficult to align or compare cortical signals across participants. In this paper we address this through the use of surface vision transformers, which build a generalisable model of cortical functional dynamics, through encoding the topography of cortical networks and their interactions as a moving image across a surface. This is then combined with tri-modal self-supervised contrastive (CLIP) alignment of audio, video, and fMRI modalities to enable the retrieval of visual and auditory stimuli from patterns of cortical activity (and vice-versa). We validate our approach on 7T task-fMRI data from 174 healthy participants engaged in the movie-watching experiment from the Human Connectome Project (HCP). Results show that it is possible to detect which movie clips an individual is watching purely from their brain activity, even for individuals and movies not seen during training. Further analysis of attention maps reveals that our model captures individual patterns of brain activity that reflect semantic and visual systems. This opens the door to future personalised simulations of brain function. Code&pre-trained models will be made available at https://github.com/metrics-lab/sim, processed data for training will be available upon request at https://gin.g-node.org/Sdahan30/sim.

Problem

Research questions and friction points this paper is trying to address.

Artificial Intelligence

Brain-Computer Interface

Cross-Individual Variability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Surface Visual Transformer

Contrastive Learning

fMRI Data Analysis

🔎 Similar Papers

No similar papers found.