🤖 AI Summary
This study addresses the cross-modal decoding problem from functional magnetic resonance imaging (fMRI) to video. Methodologically, it proposes the first biologically inspired, ventral–dorsal dual-pathway framework that jointly models semantic (“What”), spatial (“Where”), and dynamic motion (“How”) representations—mirroring canonical neuroanatomical principles. The architecture employs a disentangled–fused design: a multi-branch diffusion decoder integrates neural feature alignment via cross-modal projection and gated fusion, enabling explicit modeling of motion dynamics in fMRI-to-video synthesis for the first time, while establishing interpretable correspondences between decoding branches and ventral/dorsal streams. Experiments demonstrate state-of-the-art performance: 82.4% semantic classification accuracy, 70.6% spatial consistency, 0.212 cosine similarity for motion prediction, and 21.9% top-1 accuracy on 50-class video generation. Neural encoding analyses further corroborate the dual-stream hypothesis.
📝 Abstract
Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.