NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This study addresses the challenging problem of cross-subject fMRI-based reconstruction of complex natural-scene images, hindered by substantial inter-subject neural response variability and highly abstract semantic encoding. To tackle these issues, we propose a lightweight dual-adapter diffusion framework: an AutoKL module models low-level spatial fMRI features, while a CLIP adapter aligns high-level semantic representations. Only 17% of parameters are fine-tuned, enabling rapid cross-subject adaptation. The CLIP adapter is jointly trained on Stable Diffusion–generated images and COCO captions to emulate semantic coding in higher visual cortices, and is end-to-end coupled with fMRI feature mapping and diffusion-based image reconstruction. Our method achieves state-of-the-art performance—trained in just one hour per subject on three RTX 4090 GPUs—demonstrating superior efficiency, generalizability across subjects, and scalability compared to existing approaches.

Technology Category

Application Category

📝 Abstract

Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

Problem

Research questions and friction points this paper is trying to address.

Achieving accurate cross-subject fMRI visual reconstruction

Addressing neural variability and abstract semantic encoding

Reducing computational demands for complex scene reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates AutoKL and CLIP adapters via diffusion

Trains CLIP Adapter on Stable Diffusion generated images

Fine-tunes only 17 percent parameters for new subjects

🔎 Similar Papers

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

2024-05-06arXiv.orgCitations: 2

Brain3D: Generating 3D Objects from fMRI

2024-05-24Citations: 0

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

US, CA, Santa Clara

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)