NeuroCLIP: Brain-Inspired Prompt Tuning for EEG-to-Image Multimodal Contrastive Learning

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing methods treat CLIP as a static feature extractor, overlooking its neural representational plasticity and the neuro-symbolic gap in EEG–image alignment. To address this, we propose a neuroscience-inspired multimodal contrastive learning framework featuring dual-stream visual embedding, global visual prompt token injection, and a novel contrastive loss grounded in human visual encoding mechanisms—enabling, for the first time, joint global- and instance-level prompt optimization. Our method integrates dynamic bandpass filtering, token-level cross-modal fusion, and in-Transformer prompt tuning, substantially enhancing semantic alignment from EEG to images. On the THINGS-EEG2 dataset, our approach achieves 63.2% zero-shot image retrieval Top-1 accuracy—surpassing the state of the art by 12.3% (including a +4.6% gain in cross-subject generalization). This work establishes an interpretable, generalizable cross-modal alignment paradigm for brain–computer interfaces and neural semantic decoding.

Technology Category

Application Category

📝 Abstract

Recent advances in brain-inspired artificial intelligence have sought to align neural signals with visual semantics using multimodal models such as CLIP. However, existing methods often treat CLIP as a static feature extractor, overlooking its adaptability to neural representations and the inherent physiological-symbolic gap in EEG-image alignment. To address these challenges, we present NeuroCLIP, a prompt tuning framework tailored for EEG-to-image contrastive learning. Our approach introduces three core innovations: (1) We design a dual-stream visual embedding pipeline that combines dynamic filtering and token-level fusion to generate instance-level adaptive prompts, which guide the adjustment of patch embedding tokens based on image content, thereby enabling fine-grained modulation of visual representations under neural constraints; (2) We are the first to introduce visual prompt tokens into EEG-image alignment, acting as global, modality-level prompts that work in conjunction with instance-level adjustments. These visual prompt tokens are inserted into the Transformer architecture to facilitate neural-aware adaptation and parameter optimization at a global level; (3) Inspired by neuroscientific principles of human visual encoding, we propose a refined contrastive loss that better model the semantic ambiguity and cross-modal noise present in EEG signals. On the THINGS-EEG2 dataset, NeuroCLIP achieves a Top-1 accuracy of 63.2% in zero-shot image retrieval, surpassing the previous best method by +12.3%, and demonstrates strong generalization under inter-subject conditions (+4.6% Top-1), highlighting the potential of physiology-aware prompt tuning for bridging brain signals and visual semantics.

Problem

Research questions and friction points this paper is trying to address.

Bridging the physiological-symbolic gap in EEG-image alignment

Adapting CLIP models to neural representations through prompt tuning

Addressing semantic ambiguity in EEG signals for visual decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic filtering and token fusion for adaptive prompts

Visual prompt tokens in Transformer for global adaptation

Neuroscience-inspired contrastive loss for semantic ambiguity

🔎 Similar Papers

Achieving more human brain-like vision via human EEG representational alignment