🤖 AI Summary
This study addresses the cross-subject fMRI-based visual decoding challenge—reconstructing continuous naturalistic visual experiences without subject-specific training. Key bottlenecks include the absence of explicit modeling of the ventral–dorsal visual stream functional hierarchy and poor generalization of semantic representations across subjects. To overcome these, we propose VCFlow: the first decoding framework that explicitly embeds a hierarchical ventral–dorsal stream architecture within the model, jointly capturing early visual cortex responses and multi-stream high-level cognitive features via feature disentanglement and feature-level contrastive learning. Experiments demonstrate that VCFlow achieves near-state-of-the-art reconstruction fidelity (only 7% accuracy drop), reduces single-video decoding time to 10 seconds, and enables zero-shot cross-subject transfer without fine-tuning. These advances significantly enhance clinical deployability and cross-subject generalization robustness.
📝 Abstract
Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.