🤖 AI Summary
This study addresses a critical gap in brain alignment research, which has largely focused on language comprehension or passive visual tasks, by systematically investigating the alignment between foundational models and human brain activity during naturalistic interaction. For the first time, the authors apply a vision-language model (VLM) and a large action model (LAM) to analyze fMRI data recorded while participants played Atari-style games. Using voxel-wise encoding and variance partitioning, they assess how internal model representations align with neural responses under varying prompting conditions. Results demonstrate that both VLM and LAM significantly outperform reinforcement learning baselines. Prompt effects intensify along the cortical hierarchy, peaking in frontoparietal and motor planning regions. Moreover, LAM representations preferentially align with action-related areas, whereas VLM exhibits more symmetric cross-modal alignment across sensory and cognitive systems.
📝 Abstract
Understanding how humans and artificial intelligence systems predict and plan by interacting with their environment is a fundamental challenge at the intersection of neuroscience and machine learning. Most brain-encoding studies focus on aligning artificial models with brain activity during language comprehension or passive visual processing, while interactive brain-alignment studies have to date been largely limited to reinforcement-learning (RL) agents and theory-based models. To address this gap, we study brain alignment of representative models from two foundation-model families, namely vision-language models (VLMs) and large-action models (LAMs), using fMRI recordings from participants playing naturalistic Atari-style video games. Specifically, we examine how action-focused and reasoning-focused prompts shape model's internal representations and align with fMRI brain activity. First, we find that both VLMs and LAMs exhibit significantly exhibit voxel-wise encoding performance than RL baselines, with the advantage holding even under matched feature dimensionality. Second, prompt-driven gains scale with the cortical processing hierarchy: the largest improvements appear in frontal-parietal and motor-planning regions, while early visual cortex gains roughly half as much. Third, variance partitioning reveals a qualitatively different representational organization: VLM is prompt-symmetric (12.5% unique action vs. 13.6% unique reasoning), whereas LAM is prompt-asymmetric (27% unique action vs. -5% unique reasoning), with the asymmetry strongest in frontal-motor cortex. Together, these results demonstrate that action-specialized fine-tuning reorganizes multimodal representations toward action-relevant neural computations even when whole-brain prediction accuracy is statistically equivalent between VLM and LAM.