ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

233K/year
🤖 AI Summary
This study addresses the challenge of high-fidelity generation of electroencephalography (EEG) and magnetoencephalography (MEG) signals from visual stimuli and achieves cross-modal alignment between visual and neural representations. To this end, the authors propose the ViBE framework, which employs a spatiotemporal convolutional variational autoencoder (TSC-VAE) to reconstruct M/EEG signals and integrates a Q-Former module to map CLIP image embeddings into a neural latent space. Alignment is optimized jointly at both feature and distribution levels using mean squared error and sliced Wasserstein distance. This approach represents the first integration of spatiotemporal VAEs with distribution alignment strategies, significantly improving neural response reconstruction quality on the THINGS-EEG2 and THINGS-MEG datasets and effectively bridging the semantic gap between visual and neural modalities.
📝 Abstract
Brain encoding models not only serve to decipher how visual stimuli are transformed into neural responses, but also represent a critical step toward visual prostheses that restore vision for patients with severe vision disorders. Brain encoding involves two fundamental steps: achieving faithful reconstruction of neural responses and establishing cross-modal alignment between visual stimuli and neural responses. To this end, we propose ViBE, a novel brain encoding framework for generating magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli. Specifically, we first design a spatio-temporal convolutional variational autoencoder (TSC-VAE) that captures the spatio-temporal characteristics of M/EEG signals for effective neural response reconstruction. To bridge the modality gap between visual features and neural representations, we employ Q-Former to map CLIP image embeddings to the TSC-VAE latent space, producing neural proxy embeddings. For comprehensive cross-modal alignment, we combine mean squared error (MSE) loss for point-wise feature matching with sliced Wasserstein distance (SWD) for probability distribution alignment between the neural proxy embeddings and TSC-VAE latent embeddings. We conduct extensive experiments on the THINGS-EEG2 and THINGS-MEG datasets, demonstrating the effectiveness of our approach in generating high-quality M/EEG signals from visual stimuli.
Problem

Research questions and friction points this paper is trying to address.

brain encoding
visual stimuli
MEG
EEG
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatio-temporal VAE
distribution-aligned projection
cross-modal alignment
neural response reconstruction
Q-Former