🤖 AI Summary
This study addresses the limitation of existing brain encoding models that rely on unimodal representations and struggle to effectively integrate visual, auditory, and linguistic information for predicting whole-brain responses to natural audiovisual stimuli. To overcome this, the authors propose the MIRAGE framework, which leverages a native multimodal backbone network equipped with an inter-layer adaptive gating mechanism, a Transformer-based brain encoder, and subject-specific linear readout heads informed by cortical parcellation. The approach achieves state-of-the-art performance in whole-brain fMRI response prediction, demonstrating that native multimodal representations outperform post-hoc fusion strategies. Furthermore, interpretable gating weights reveal distinct cortical activation patterns associated with each sensory modality, offering insights into how multimodal information is differentially processed across the brain.
📝 Abstract
Recent progress in task-optimized neural networks has established encoding models as a powerful tool for predicting brain responses to naturalistic stimuli, yet most existing approaches rely on unimodal representations. The emergence of omni-modal foundation models and rich multimodal neural datasets enables encoding models that jointly integrate visual, auditory, and linguistic information across subjects. We introduce MIRAGE, a brain encoding framework for predicting whole-brain fMRI responses to naturalistic audiovisual stimuli. MIRAGE achieves state-of-the-art performance via a native multimodal backbone and adaptive feature gating across layers. These representations are then combined with a transformer-based brain encoder and a subject-specific linear head over the cortical parcels. Controlled comparisons show that natively multimodal features consistently outperform post-hoc aggregation of independent unimodal features, across architectural levels and backbones. Beyond predictive accuracy, the learned attention weights are directly inspectable to interpret the modality-specific gating profile over the backbone, and each modality traces a distinct anatomical pattern across cortex. Together, these results propose adaptive layer-wise aggregation of natively multimodal features as a generalizable, interpretable, and accurate approach for whole-brain encoding.