A foundation model of vision, audition, and language for in-silico neuroscience

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

215K/year

📝 Abstract

Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, audio and language) foundation model capable of predicting human brain activity in a variety of naturalistic and experimental conditions. Leveraging a unified dataset of over 1,000 hours of fMRI across 720 subjects, we demonstrate that our model accurately predicts high-resolution brain responses for novel stimuli, tasks and subjects, superseding traditional linear encoding models, delivering several-fold improvements in accuracy. Critically, TRIBE v2 enables in silico experimentation: tested on seminal visual and neuro-linguistic paradigms, it recovers a variety of results established by decades of empirical research. Finally, by extracting interpretable latent features, TRIBE v2 reveals the fine-grained topography of multisensory integration. These results establish artificial intelligence as a unifying framework for exploring the functional organization of the human brain.

Problem

Research questions and friction points this paper is trying to address.

cognitive neuroscience

foundation model

multisensory integration

brain activity prediction

unified model

Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model

multimodal integration

fMRI prediction

in silico neuroscience

interpretable latent features

🔎 Similar Papers

No similar papers found.