🤖 AI Summary
Visual neuroprosthetics face a fundamental bottleneck: insufficient biological plausibility of generated neural signals, primarily due to the absence of ground-truth brain responses to supervise and evaluate the neurobiological validity of predicted stimuli. To address this, we propose a cross-attention-enhanced diffusion framework that jointly leverages semantic visual features from a pretrained CLIP model and multimodal (EEG/MEG) neural recordings, enabling fine-grained cross-modal alignment during the denoising process. Crucially, our method intrinsically embeds biologically interpretable constraints into the diffusion generative process—without requiring paired stimulus–response training data—thereby significantly improving neural consistency of synthesized signals. Evaluated on THINGS-EEG2 and THINGS-MEG, our approach achieves high-fidelity signal reconstruction while generating topographic dynamics that closely match subject-specific neural activity patterns. This work establishes a novel paradigm for vision neuroscience decoding and closed-loop neuroprosthetic systems.
📝 Abstract
Visual prostheses have shown great potential in restoring vision for blind individuals. On the one hand, researchers have been continuously improving the brain decoding framework of visual prostheses by leveraging the powerful image generation capabilities of diffusion models. On the other hand, the brain encoding stage of visual prostheses struggles to generate brain signals with sufficient biological similarity. Although existing works have recognized this problem, the quality of predicted stimuli still remains a critical issue, as existing approaches typically lack supervised signals from real brain responses to validate the biological plausibility of predicted stimuli. To address this issue, we propose a novel image-to-brain framework based on denoising diffusion probabilistic models (DDPMs) enhanced with cross-attention mechanisms. Our framework consists of two key architectural components: a pre-trained CLIP visual encoder that extracts rich semantic representations from input images, and a cross-attention enhanced U-Net diffusion model that learns to reconstruct biologically plausible brain signals through iterative denoising. Unlike conventional generative models that rely on simple concatenation for conditioning, our cross-attention modules enable dynamic interaction between visual features and brain signal representations, facilitating fine-grained alignment during the generation process. We evaluate our framework on two multimodal datasets (THINGS-EEG2 and THINGS-MEG) to demonstrate its effectiveness in generating biologically plausible brain signals. Moreover, we visualize the training and test M/EEG topographies for all subjects on both datasets to intuitively demonstrate the intra-subject variations and inter-subject variations in M/EEG signals.