Predicting Brain Responses To Natural Movies With Multimodal LLMs

📅 2025-07-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of predicting cortical neural responses to naturalistic movie stimuli. Methodologically, it integrates multimodal large language models—including V-JEPA2 (video), Whisper (speech), Llama 3.2 (text), InternVL3 (vision-language), and Qwen2.5-Omni (multimodal)—to extract aligned spatiotemporal features from video, audio, and script modalities; these features are temporally aligned, linearly projected, and fed into a lightweight encoder. A novel dual-head architecture is proposed: a shared group-level head coupled with subject-specific residual heads, complemented by large-model selection and cortical parcellation–based ensemble modeling to improve cross-movie generalization. On the out-of-distribution test set, the model achieves a mean Pearson correlation coefficient of 0.2085, ranking fourth in the competition; post-hoc optimization elevates performance to second place. Results validate that multimodal representation fusion and region-specific cortical modeling significantly enhance neural response prediction accuracy.

Technology Category

Application Category

📝 Abstract
We present MedARC's team solution to the Algonauts 2025 challenge. Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5-Omni). These features extracted from the models were linearly projected to a latent space, temporally aligned to the fMRI time series, and finally mapped to cortical parcels through a lightweight encoder comprising a shared group head plus subject-specific residual heads. We trained hundreds of model variants across hyperparameter settings, validated them on held-out movies and assembled ensembles targeted to each parcel in each subject. Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition. We further discuss a last-minute optimization that would have raised us to second place. Our results highlight how combining features from models trained in different modalities, using a simple architecture consisting of shared-subject and single-subject components, and conducting comprehensive model selection and ensembling improves generalization of encoding models to novel movie stimuli. All code is available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Predict brain responses to natural movies using multimodal LLMs
Align and map multimodal features to fMRI data
Improve generalization of encoding models for novel stimuli
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverage multimodal LLMs for brain response prediction
Linear projection and temporal alignment to fMRI
Shared group head with subject-specific residuals
🔎 Similar Papers
No similar papers found.
C
Cesar Kadir Torrico Villanueva
Medical AI Research Center (MedARC)
J
Jiaxin Cindy Tu
Medical AI Research Center (MedARC), Psychological and Brain Sciences, Dartmouth College
M
Mihir Tripathy
Medical AI Research Center (MedARC), Core for Advanced Magnetic Resonance Imaging (CAMRI), Baylor College of Medicine
Connor Lane
Connor Lane
Sophont
R
Rishab Iyer
Medical AI Research Center (MedARC)
Paul S. Scotti
Paul S. Scotti
Research Scientist, Princeton University
NeuroAIComputational Cognitive NeuroscienceNeuroimagingOpen source