ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This study addresses the challenge of high-fidelity generation of electroencephalography (EEG) and magnetoencephalography (MEG) signals from visual stimuli and achieves cross-modal alignment between visual and neural representations. To this end, the authors propose the ViBE framework, which employs a spatiotemporal convolutional variational autoencoder (TSC-VAE) to reconstruct M/EEG signals and integrates a Q-Former module to map CLIP image embeddings into a neural latent space. Alignment is optimized jointly at both feature and distribution levels using mean squared error and sliced Wasserstein distance. This approach represents the first integration of spatiotemporal VAEs with distribution alignment strategies, significantly improving neural response reconstruction quality on the THINGS-EEG2 and THINGS-MEG datasets and effectively bridging the semantic gap between visual and neural modalities.

📝 Abstract

Brain encoding models not only serve to decipher how visual stimuli are transformed into neural responses, but also represent a critical step toward visual prostheses that restore vision for patients with severe vision disorders. Brain encoding involves two fundamental steps: achieving faithful reconstruction of neural responses and establishing cross-modal alignment between visual stimuli and neural responses. To this end, we propose ViBE, a novel brain encoding framework for generating magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli. Specifically, we first design a spatio-temporal convolutional variational autoencoder (TSC-VAE) that captures the spatio-temporal characteristics of M/EEG signals for effective neural response reconstruction. To bridge the modality gap between visual features and neural representations, we employ Q-Former to map CLIP image embeddings to the TSC-VAE latent space, producing neural proxy embeddings. For comprehensive cross-modal alignment, we combine mean squared error (MSE) loss for point-wise feature matching with sliced Wasserstein distance (SWD) for probability distribution alignment between the neural proxy embeddings and TSC-VAE latent embeddings. We conduct extensive experiments on the THINGS-EEG2 and THINGS-MEG datasets, demonstrating the effectiveness of our approach in generating high-quality M/EEG signals from visual stimuli.

Problem

Research questions and friction points this paper is trying to address.

brain encoding

visual stimuli

MEG

EEG

cross-modal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatio-temporal VAE

distribution-aligned projection

cross-modal alignment

neural response reconstruction

Q-Former

🔎 Similar Papers

Time-Dependent VAE for Building Latent Representations from Visual Neural Activity with Complex Dynamics

2024-08-15Citations: 0

Brain-aligning of semantic vectors improves neural decoding of visual stimuli

2024-03-22Citations: 0