A foundation model of vision, audition, and language for in-silico neuroscience

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
📝 Abstract
Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, audio and language) foundation model capable of predicting human brain activity in a variety of naturalistic and experimental conditions. Leveraging a unified dataset of over 1,000 hours of fMRI across 720 subjects, we demonstrate that our model accurately predicts high-resolution brain responses for novel stimuli, tasks and subjects, superseding traditional linear encoding models, delivering several-fold improvements in accuracy. Critically, TRIBE v2 enables in silico experimentation: tested on seminal visual and neuro-linguistic paradigms, it recovers a variety of results established by decades of empirical research. Finally, by extracting interpretable latent features, TRIBE v2 reveals the fine-grained topography of multisensory integration. These results establish artificial intelligence as a unifying framework for exploring the functional organization of the human brain.
Problem

Research questions and friction points this paper is trying to address.

cognitive neuroscience
foundation model
multisensory integration
brain activity prediction
unified model
Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model
multimodal integration
fMRI prediction
in silico neuroscience
interpretable latent features
🔎 Similar Papers
No similar papers found.