MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of achieving high-accuracy zero-shot visual decoding from non-invasive electroencephalography (EEG) signals, aiming to bridge the gap between neural activity and visual semantics. To this end, it proposes the first trilinear contrastive learning framework that incorporates language supervision as semantic regularization. The approach employs a two-stage training strategy: first pretraining an EEG encoder on unlabeled data, then jointly aligning EEG, images, and language—using large language model (LLM)-generated textual descriptions—into a unified semantic space. The method integrates subject-specific adaptation, graph attention, spatiotemporal convolution, masked reconstruction pretraining, and compact CN-CLIP embeddings. Evaluated on the Things-EEG2 benchmark, it achieves 54.1% Top-1 and 83.4% Top-5 accuracy, substantially outperforming state-of-the-art methods, and demonstrates strong generalization on the Things-MEG dataset.
📝 Abstract
Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. We introduce a tri-modal contrastive framework for EEG-based visual decoding that aligns EEG, visual, and textual representations within a unified latent space. Our approach follows a two-stage design. First, we pre-train an EEG encoder via masked reconstruction on unlabeled trials, learning spatio-temporal regularities that transfer robustly to downstream tasks. Second, we jointly align EEG, image, and LLM-generated textual descriptions through contrastive learning, where text supervision acts as a semantic regularizer that injects linguistic structure into the shared space without overwhelming the primary EEG-image signal. The encoder integrates subject-specific adaptation, graph-attention over channels, and temporal-spatial convolutional embeddings. On the Things-EEG2 200-way zero-shot benchmark, our framework achieves 54.1% Top-1 and 83.4% Top-5 accuracy, substantially exceeding the strongest prior baseline (32.4% / 64.0%), with paired Wilcoxon tests confirming significance (p < 0.01) over all in-subject baselines. We validate generalization on Things-MEG. Analysis reveals that compact embedding geometries (CN-CLIP) outperform much larger backbones, and that decoding aligns with established neurophysiology of visual processing. This work is a critical step towards robust, semantically-grounded visual decoding from non-invasive temporal neural signals. The source code is publicly available in https://github.com/anon-eeg/eeg_image_decoding.
Problem

Research questions and friction points this paper is trying to address.

visual decoding
EEG
zero-shot
brain-computer interface
neural representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

EEG-based visual decoding
tri-modal contrastive learning
zero-shot inference
neural-text alignment
spatio-temporal EEG encoding