HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

πŸ“… 2025-06-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses structural distortion and semantic loss in fMRI-based decoding of complex visual stimuli. To tackle the challenge of reconstructing high-density, multi-level spatial structures and rich semantic content from fMRI data, we propose a dual-adapter hierarchical modeling framework: (1) an AutoKL adapter that preserves the topological and geometric properties of fMRI representations; and (2) a CLIP adapter that aligns text–image joint embeddings to capture fine-grained semantics. These adapters jointly drive image synthesis within the Versatile Diffusion generative framework. To our knowledge, this is the first approach unifying latent-space structural preservation with multimodal semantic guidance for fMRI decoding. Evaluated on complex natural scenes, our method significantly improves both structural fidelity and semantic accuracy of reconstructed images, outperforming all existing fMRI-to-image decoding models across quantitative and qualitative metrics.

Technology Category

Application Category

πŸ“ Abstract
Reconstructing visual information from brain activity bridges the gap between neuroscience and computer vision. Even though progress has been made in decoding images from fMRI using generative models, a challenge remains in accurately recovering highly complex visual stimuli. This difficulty stems from their elemental density and diversity, sophisticated spatial structures, and multifaceted semantic information. To address these challenges, we propose HAVIR that contains two adapters: (1) The AutoKL Adapter transforms fMRI voxels into a latent diffusion prior, capturing topological structures; (2) The CLIP Adapter converts the voxels to CLIP text and image embeddings, containing semantic information. These complementary representations are fused by Versatile Diffusion to generate the final reconstructed image. To extract the most essential semantic information from complex scenarios, the CLIP Adapter is trained with text captions describing the visual stimuli and their corresponding semantic images synthesized from these captions. The experimental results demonstrate that HAVIR effectively reconstructs both structural features and semantic information of visual stimuli even in complex scenarios, outperforming existing models.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing complex visual stimuli from brain activity
Bridging neuroscience and computer vision with image decoding
Improving accuracy in recovering diverse semantic and spatial structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

AutoKL Adapter transforms fMRI to latent diffusion
CLIP Adapter converts fMRI to semantic embeddings
Versatile Diffusion fuses representations for reconstruction
πŸ”Ž Similar Papers
No similar papers found.