🤖 AI Summary
To address challenges including modality misalignment between electroencephalography (EEG) signals and clinical text, difficulty in modeling EEG time series, and lack of cross-modal alignment, this paper proposes the first EEG–Language Model (ELM). Leveraging 15,000 paired EEG-report samples, ELM introduces temporal cropping, segmented text encoding, and multi-instance learning (MIL) to mitigate granularity mismatch between modalities. It further constructs a unified EEG–language representation space, enabling zero-shot pathological classification and bidirectional cross-modal retrieval (EEG ↔ report). Experiments demonstrate that ELM significantly outperforms unimodal baselines across four evaluation tasks—marking the first successful realization of EEG-driven zero-shot classification—and achieves up to a 12.3% improvement in cross-modal retrieval AUC. This work establishes a novel paradigm for joint modeling of neural signals and clinical text.
📝 Abstract
Multimodal language modeling has enabled breakthroughs for representation learning, yet remains unexplored in the realm of functional brain data for pathology detection. This paper pioneers EEG-language models (ELMs) trained on clinical reports and 15000 EEGs. We propose to combine multimodal alignment in this novel domain with timeseries cropping and text segmentation, enabling an extension based on multiple instance learning to alleviate misalignment between irrelevant EEG or text segments. Our multimodal models significantly improve pathology detection compared to EEG-only models across four evaluations and for the first time enable zero-shot classification as well as retrieval of both neural signals and reports. In sum, these results highlight the potential of ELMs, representing significant progress for clinical applications.