Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This study investigates neural response prediction to natural audiovisual stimuli, focusing on the dominant role of multimodal—particularly continuous audiovisual—inputs in neural encoding and their impact on model generalization. Methodologically, we integrate high-fidelity visual features from X-CLIP and auditory features from Whisper, combined with attention mechanisms and both linear and deep encoding models. We find that linguistic text features do not improve prediction accuracy, whereas continuous audiovisual streams strongly dominate neural representations. Key contributions are: (1) empirical demonstration that non-linguistic sensory inputs constitute the primary driver of neural encoding; (2) a simple linear encoding model achieving an 18% improvement over baselines on out-of-distribution (OOD) data; and (3) spatial fMRI analysis confirming significantly enhanced prediction accuracy in auditory cortex, validating the efficacy of high-fidelity speech representations.

Technology Category

Application Category

📝 Abstract

Predicting brain activity in response to naturalistic, multimodal stimuli is a key challenge in computational neuroscience. While encoding models are becoming more powerful, their ability to generalize to truly novel contexts remains a critical, often untested, question. In this work, we developed brain encoding models using state-of-the-art visual (X-CLIP) and auditory (Whisper) feature extractors and rigorously evaluated them on both in-distribution (ID) and diverse out-of-distribution (OOD) data. Our results reveal a fundamental trade-off between model complexity and generalization: a higher-capacity attention-based model excelled on ID data, but a simpler linear model was more robust, outperforming a competitive baseline by 18% on the OOD set. Intriguingly, we found that linguistic features did not improve predictive accuracy, suggesting that for familiar languages, neural encoding may be dominated by the continuous visual and auditory streams over redundant textual information. Spatially, our approach showed marked performance gains in the auditory cortex, underscoring the benefit of high-fidelity speech representations. Collectively, our findings demonstrate that rigorous OOD testing is essential for building robust neuro-AI models and provides nuanced insights into how model architecture, stimulus characteristics, and sensory hierarchies shape the neural encoding of our rich, multimodal world.

Problem

Research questions and friction points this paper is trying to address.

Predict brain activity to multimodal naturalistic stimuli

Assess encoding model generalization to novel contexts

Explore sensory dominance in neural encoding hierarchies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used X-CLIP and Whisper for feature extraction

Compared attention-based and linear models

Tested models on diverse OOD datasets

🔎 Similar Papers

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity