🤖 AI Summary
Existing zero-shot EEG-to-image retrieval methods suffer from limited cross-subject generalization due to their neglect of individual differences in multi-granularity representations of EEG signals. This work proposes SAMGA, a novel framework that introduces, for the first time, a subject-aware multi-granularity visual supervision objective combined with a coarse-to-fine cross-modal alignment strategy. By leveraging intermediate-layer features from a pretrained visual encoder through adaptive aggregation, SAMGA simultaneously enhances semantic geometric stability and instance discriminability within a shared encoder, effectively balancing subject-specific neural response characteristics with cross-subject generalizability. Evaluated on the THINGS-EEG benchmark, the method achieves intra-subject Top-1 and Top-5 retrieval accuracies of 91.3% and 98.8%, respectively, and cross-subject accuracies of 34.4% and 64.8%, significantly outperforming current state-of-the-art approaches.
📝 Abstract
Zero-shot EEG-to-image retrieval aims to decode perceived visual content from electroencephalography (EEG) by aligning neural responses with pretrained visual representations, providing a promising route toward scalable visual neural decoding and practical brain-computer interfaces. However, robust EEG-to-image retrieval remains challenging, because prior methods usually rely on either a single fixed visual target or a subject-invariant target construction scheme. Such designs overlook two important properties of visually evoked EEG signals: they preserve information across multiple representational scales, and the visual granularity best matched to EEG may vary across subjects. To address these issues, subject-aware multi-granularity alignment (SAMGA) framework is proposed for zero-shot EEG-to-image retrieval. SAMGA first constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing the model to absorb subject-dependent granularity deviations during training while preserving subject-agnostic inference. Building on this adaptive target construction, a coarse-to-fine cross-modal alignment strategy is further designed with a shared encoder wherein the coarse stage stabilizes the shared semantic geometry and reduces subject-induced distribution shift, and the fine stage further improves instance-level retrieval discrimination. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves 91.3% Top-1 and 98.8% Top-5 accuracy in the intra-subject setting, and 34.4% Top-1 and 64.8% Top-5 accuracy in the inter-subject setting, outperforming recent state-of-the-art methods.