🤖 AI Summary
This study addresses the challenge of predicting cortical neural responses to naturalistic movie stimuli. Methodologically, it integrates multimodal large language models—including V-JEPA2 (video), Whisper (speech), Llama 3.2 (text), InternVL3 (vision-language), and Qwen2.5-Omni (multimodal)—to extract aligned spatiotemporal features from video, audio, and script modalities; these features are temporally aligned, linearly projected, and fed into a lightweight encoder. A novel dual-head architecture is proposed: a shared group-level head coupled with subject-specific residual heads, complemented by large-model selection and cortical parcellation–based ensemble modeling to improve cross-movie generalization. On the out-of-distribution test set, the model achieves a mean Pearson correlation coefficient of 0.2085, ranking fourth in the competition; post-hoc optimization elevates performance to second place. Results validate that multimodal representation fusion and region-specific cortical modeling significantly enhance neural response prediction accuracy.
📝 Abstract
We present MedARC's team solution to the Algonauts 2025 challenge. Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5-Omni). These features extracted from the models were linearly projected to a latent space, temporally aligned to the fMRI time series, and finally mapped to cortical parcels through a lightweight encoder comprising a shared group head plus subject-specific residual heads. We trained hundreds of model variants across hyperparameter settings, validated them on held-out movies and assembled ensembles targeted to each parcel in each subject. Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition. We further discuss a last-minute optimization that would have raised us to second place. Our results highlight how combining features from models trained in different modalities, using a simple architecture consisting of shared-subject and single-subject components, and conducting comprehensive model selection and ensembling improves generalization of encoding models to novel movie stimuli. All code is available on GitHub.