Diffusion-Based Image-to-Brain Signal Generation with Cross-Attention Mechanisms for Visual Prostheses

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Visual neuroprosthetics face a fundamental bottleneck: insufficient biological plausibility of generated neural signals, primarily due to the absence of ground-truth brain responses to supervise and evaluate the neurobiological validity of predicted stimuli. To address this, we propose a cross-attention-enhanced diffusion framework that jointly leverages semantic visual features from a pretrained CLIP model and multimodal (EEG/MEG) neural recordings, enabling fine-grained cross-modal alignment during the denoising process. Crucially, our method intrinsically embeds biologically interpretable constraints into the diffusion generative process—without requiring paired stimulus–response training data—thereby significantly improving neural consistency of synthesized signals. Evaluated on THINGS-EEG2 and THINGS-MEG, our approach achieves high-fidelity signal reconstruction while generating topographic dynamics that closely match subject-specific neural activity patterns. This work establishes a novel paradigm for vision neuroscience decoding and closed-loop neuroprosthetic systems.

Technology Category

Application Category

📝 Abstract

Visual prostheses have shown great potential in restoring vision for blind individuals. On the one hand, researchers have been continuously improving the brain decoding framework of visual prostheses by leveraging the powerful image generation capabilities of diffusion models. On the other hand, the brain encoding stage of visual prostheses struggles to generate brain signals with sufficient biological similarity. Although existing works have recognized this problem, the quality of predicted stimuli still remains a critical issue, as existing approaches typically lack supervised signals from real brain responses to validate the biological plausibility of predicted stimuli. To address this issue, we propose a novel image-to-brain framework based on denoising diffusion probabilistic models (DDPMs) enhanced with cross-attention mechanisms. Our framework consists of two key architectural components: a pre-trained CLIP visual encoder that extracts rich semantic representations from input images, and a cross-attention enhanced U-Net diffusion model that learns to reconstruct biologically plausible brain signals through iterative denoising. Unlike conventional generative models that rely on simple concatenation for conditioning, our cross-attention modules enable dynamic interaction between visual features and brain signal representations, facilitating fine-grained alignment during the generation process. We evaluate our framework on two multimodal datasets (THINGS-EEG2 and THINGS-MEG) to demonstrate its effectiveness in generating biologically plausible brain signals. Moreover, we visualize the training and test M/EEG topographies for all subjects on both datasets to intuitively demonstrate the intra-subject variations and inter-subject variations in M/EEG signals.

Problem

Research questions and friction points this paper is trying to address.

Generating biologically plausible brain signals for visual prostheses

Improving brain encoding stage with cross-attention diffusion models

Validating biological similarity of predicted visual stimuli

Innovation

Methods, ideas, or system contributions that make the work stand out.

DDPMs with cross-attention mechanisms

CLIP encoder extracts semantic representations

Cross-attention enables dynamic feature interaction

🔎 Similar Papers

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

2024-05-06arXiv.orgCitations: 2

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)