🤖 AI Summary
This study addresses cross-subject prediction of fMRI brain responses to movie stimuli. We propose a multimodal representation fusion and ensemble learning framework that jointly leverages large language models (LLMs), video encoders, audio encoders, and vision-language models (VLMs). To enrich textual semantics, we incorporate movie transcripts and summaries as auxiliary inputs. Modality alignment is optimized via stimulus-specific tuning and staged fine-tuning strategies. Predictions from individual modality-specific regressors are integrated using stacked regression. Evaluated on the Algonauts 2025 Challenge, our framework ranks 10th globally (Top 12%), significantly outperforming baseline unimodal approaches. All code and preprocessed resources are publicly released, establishing a reproducible paradigm for multimodal neural decoding and interpretable modeling of brain activity.
📝 Abstract
We present our submission to the Algonauts 2025 Challenge, where the goal is to predict fMRI brain responses to movie stimuli. Our approach integrates multimodal representations from large language models, video encoders, audio models, and vision-language models, combining both off-the-shelf and fine-tuned variants. To improve performance, we enhanced textual inputs with detailed transcripts and summaries, and we explored stimulus-tuning and fine-tuning strategies for language and vision models. Predictions from individual models were combined using stacked regression, yielding solid results. Our submission, under the team name Seinfeld, ranked 10th. We make all code and resources publicly available, contributing to ongoing efforts in developing multimodal encoding models for brain activity.