HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses visual reconstruction distortion in natural-scene brain activity decoding, caused by low-level feature heterogeneity and high-level semantic entanglement in fMRI signals. We propose a hierarchical neural decoding framework that partitions the visual cortex into structure-oriented regions (V1–V3) and semantics-oriented regions (IT/LOC), extracting distinct fMRI representations to jointly drive the Versatile Diffusion model: V1–V3 features generate structural priors, while IT/LOC features—aligned with CLIP embeddings—impose semantic constraints, enabling end-to-end collaborative reconstruction. Our key contribution is the first explicit coupling of cortical functional parcellation with multimodal generative priors, circumventing semantic aliasing inherent in conventional single-path decoding. Experiments demonstrate significant improvements in natural image reconstruction: +1.8 dB PSNR (structural fidelity) and +12.3% CLIP-Score (semantic consistency), consistently outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing complex visual stimuli from brain activity
Overcoming heterogeneity in low-level visual features
Resolving semantic entanglement in high-level visual features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical separation of visual cortex regions
CLIP embeddings extract semantic features from voxels
Versatile Diffusion integrates structural and semantic components
🔎 Similar Papers
No similar papers found.