🤖 AI Summary
This study addresses visual reconstruction distortion in natural-scene brain activity decoding, caused by low-level feature heterogeneity and high-level semantic entanglement in fMRI signals. We propose a hierarchical neural decoding framework that partitions the visual cortex into structure-oriented regions (V1–V3) and semantics-oriented regions (IT/LOC), extracting distinct fMRI representations to jointly drive the Versatile Diffusion model: V1–V3 features generate structural priors, while IT/LOC features—aligned with CLIP embeddings—impose semantic constraints, enabling end-to-end collaborative reconstruction. Our key contribution is the first explicit coupling of cortical functional parcellation with multimodal generative priors, circumventing semantic aliasing inherent in conventional single-path decoding. Experiments demonstrate significant improvements in natural image reconstruction: +1.8 dB PSNR (structural fidelity) and +12.3% CLIP-Score (semantic consistency), consistently outperforming state-of-the-art methods.
📝 Abstract
The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.