🤖 AI Summary
This work addresses the fidelity loss and representational shift inherent in reconstructing visual stimuli from EEG/MEG signals by proposing a collaborative training framework that integrates multimodal priors—specifically image, text, depth, and edge cues. The approach combines a streamlined alignment module with a pretrained diffusion model and introduces an uncertainty-weighted similarity scoring mechanism to quantify the fidelity of each modality. Furthermore, a fusion encoder is designed to integrate shared representations across modalities, enabling more precise cross-modal alignment. Evaluated on the THINGS-EEG dataset, the method achieves substantial improvements over the state-of-the-art CognitionCapturer, with Top-1 and Top-5 retrieval accuracy gains of 25.9% and 10.6%, respectively.
📝 Abstract
Visual stimuli reconstruction from EEG remains challenging due to fidelity loss and representation shift. We propose CognitionCapturerPro, an enhanced framework that integrates EEG with multi-modal priors (images, text, depth, and edges) via collaborative training. Our core contributions include an uncertainty-weighted similarity scoring mechanism to quantify modality-specific fidelity and a fusion encoder for integrating shared representations. By employing a simplified alignment module and a pre-trained diffusion model, our method significantly outperforms the original CognitionCapturer on the THINGS-EEG dataset, improving Top-1 and Top-5 retrieval accuracy by 25.9% and 10.6%, respectively. Code is available at: https://github.com/XiaoZhangYES/CognitionCapturerPro.