🤖 AI Summary
This work addresses the instability in cross-modal alignment between electroencephalography (EEG) and vision-language modalities, primarily caused by EEG’s low signal-to-noise ratio and structural heterogeneity. To tackle this, the authors propose a two-stage framework: first, high-quality EEG features are extracted via Spectral-Temporal Amplitude-aware Modulation (STAM), which replaces conventional hard spectral band masking with amplitude-driven soft channel weighting; second, a model-agnostic intermediate semantic bridge (MFSB) is introduced to enable staged semantic distillation and stable alignment. Integrating multi-scale temporal convolutions with a diffusion model, the method achieves 34.50% Top-1 and 65.95% Top-5 zero-shot retrieval accuracy on the THINGS-EEG benchmark and produces semantically coherent image reconstructions.
📝 Abstract
Electroencephalography (EEG) visual decoding remains challenging due to the modality gap between low-SNR neural signals and highly structured vision--language spaces, making direct cross-modal alignment unstable. To address this, we propose STAMBRIDGE, a versatile two-stage framework that sequentially tackles feature conditioning and cross-modal alignment. First, we introduce a Spectral-Temporal Amplitude-aware Modulation (STAM) to extract well-conditioned EEG representations. By replacing hard frequency masking with amplitude-derived soft channel weighting and multi-scale temporal convolutions, STAM explicitly preserves frequency-aware transients while reducing the risk of time-domain ringing artifacts. Building upon these robust neural features, we further introduce a model-agnostic Mid-Feature Semantic Bridge (MFSB) that constructs a regularized intermediate space through directed cross-modal interactions, enabling staged distillation and more stable semantic alignment. Experiments on the THINGS-EEG benchmark show competitive 200-way zero-shot retrieval performance, with 34.50\% Top-1 and 65.95\% Top-5 accuracy. In addition, embeddings learned by STAMBRIDGE produce semantically coherent image reconstructions with a diffusion model, demonstrating robust EEG-to-vision semantic alignment. The code is available at: https://github.com/thabeatmjh/STAMBRIDGE.