VIBE: Video-Input Brain Encoder for fMRI Response Modeling

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of improving fMRI brain response modeling under multimodal naturalistic stimulation (video, audio, and text). We propose a two-stage Transformer architecture: Stage I employs modality-specific foundation models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) to extract heterogeneous features, followed by rotary position encoding for cross-modal spatiotemporal alignment and fusion; Stage II uses a temporal decoder Transformer to predict voxel- or parcel-level fMRI responses. To our knowledge, this is the first framework enabling efficient spatiotemporal alignment and joint modeling of multimodal representations under naturalistic movie paradigms. Trained on 65 hours of CNeuroMod data, our model achieves a mean parcel-wise Pearson correlation of 32.25 on the Friends S07 test set and 21.25 across six out-of-domain films—demonstrating significantly improved in-distribution and out-of-distribution generalization consistency. It ranks first in the Algonauts 2025 Challenge.

Technology Category

Application Category

📝 Abstract
We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 32.25 on in-distribution Friends S07 and 21.25 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.
Problem

Research questions and friction points this paper is trying to address.

Predict fMRI activity using multi-modal video features
Fuse video, audio, and text representations for brain encoding
Improve cross-dataset generalization for fMRI response modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage Transformer for fMRI prediction
Fuses multi-modal video, audio, text
Uses rotary embeddings in decoding
🔎 Similar Papers
No similar papers found.
D
Daniel Carlstrom Schad
Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
Shrey Dixit
Shrey Dixit
Max Planck Institute for Human Cognitive and Brain Sciences
Computational NeuroscienceArtificial IntelligenceDeep Learning
J
Janis Keck
Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany; Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany; Max Planck School of Cognition, Leipzig, Germany
Viktor Studenyak
Viktor Studenyak
Doctoral Candidate at Max Planck Institute for Human Cognitive and Brain Sciences
computational modellingdentate gyrushippocampusfmri predictiondeep learning
A
Aleksandr Shpilevoi
Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
A
Andrej Bicanski
Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany; ScaDS.AI - Center for Scalable Data Analytics and Artificial Intelligence, Leipzig, Germany