Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the limitations of existing approaches that align electroencephalography (EEG) signals only to abstract textual descriptions, thereby neglecting fine-grained perceptual information, and confronts the scarcity of visually evoked EEG data. To overcome these challenges, the authors propose a Generative Visual Grounding (GVG) framework, which introduces “visual proxy images” as a bridge between EEG and multimodal large language models (MLLMs). A lightweight EEG-to-image generative model synthesizes instance-level images from non-visual EEG, enabling both image-only and vision–language–EEG trilateral alignment strategies. By fine-tuning only 170 million parameters while freezing the 7B-parameter MLLM backbone, the method substantially enhances EEG semantic understanding and visual generation performance. Notably, GVG-X-Om matches a 1.7B-parameter text-only alignment baseline, and GVG-Janus further improves results through trilateral modality alignment.

📝 Abstract

Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.

Problem

Research questions and friction points this paper is trying to address.

EEG understanding

visual grounding

multimodal LLMs

neural signal alignment

perceptual information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Visual Grounding

EEG-to-image generation

Multimodal LLMs