๐ค AI Summary
This study investigates whether the internal representations of large vision-language models (LVLMs) align with human visual cognition. Leveraging image-evoked electroencephalography (EEG) signals and employing ridge regression alongside representational similarity analysis, the authors systematically evaluate the neural alignment between the intermediate layers of 32 open-source LVLMs and human brain responses within the 100โ300 ms post-stimulus time window. The work establishes, for the first time, a correspondence between LVLM intermediate representations and EEG signals in this critical temporal window, proposing โneural alignmentโ as a novel benchmark for evaluating LVLMs. Results indicate that multimodal architecture design contributes more significantly to brain alignment than model parameter count, and that high-performing LVLMs exhibit stronger human-like visual representational capabilities.
๐ Abstract
Large Vision Language Models (LVLMs) exhibit strong visual understanding and reasoning abilities. However, whether their internal representations reflect human visual cognition is still under-explored. In this paper, we address this by quantifying LVLM-brain alignment using image-evoked Electroencephalogram (EEG) signals, analyzing the effects of model architecture, scale, and image type. Specifically, by using ridge regression and representational similarity analysis, we compare visual representations from 32 open-source LVLMs with corresponding EEG responses. We observe a structured LVLM-brain correspondence: First, intermediate layers (8-16) show peak alignment with EEG activity in the 100-300 ms window, consistent with hierarchical human visual processing. Secondly, multimodal architectural design contributes 3.4 more to brain alignment than parameter scaling, and models with stronger downstream visual performance exhibit higher EEG similarity. Thirdly, spatiotemporal patterns further align with known cortical visual pathways. These results demonstrate that LVLMs learn human-aligned visual representations and establish neural alignment as a biologically grounded benchmark for evaluating and improving LVLMs. In addition, those results could provide insights that may inform the development of neuro-inspired applications.