Disentangling the Factors of Convergence between Brains and Computer Vision Models

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the driving mechanisms underlying convergence between AI visual representations and human brain responses. We systematically disentangle the independent and interactive effects of model architecture, training data, and training scale on brain-model representational similarity. Using self-supervised Vision Transformers (DINOv3) and multimodal neuroimaging data (fMRI and MEG), we quantify representational alignment via three complementary similarity metrics: Representational Dissimilarity Matrices (RDM), Centered Kernel Alignment (CKA), and Procrustes analysis. Our key findings are: (1) the largest-scale models trained on human-centered images achieve maximal brain similarity; and (2) their hierarchical representational alignment strictly follows the cortical developmental gradient—from primary visual cortex to prefrontal regions. This work establishes the critical synergy between data composition and model scale, and—uniquely—demonstrates a testable temporal correspondence between the dynamic evolution of AI representations and biological cortical development trajectories.

Technology Category

Application Category

📝 Abstract
Many AI models trained on natural images develop representations that resemble those of the human brain. However, the factors that drive this brain-model similarity remain poorly understood. To disentangle how the model, training and data independently lead a neural network to develop brain-like representations, we trained a family of self-supervised vision transformers (DINOv3) that systematically varied these different factors. We compare their representations of images to those of the human brain recorded with both fMRI and MEG, providing high resolution in spatial and temporal analyses. We assess the brain-model similarity with three complementary metrics focusing on overall representational similarity, topographical organization, and temporal dynamics. We show that all three factors - model size, training amount, and image type - independently and interactively impact each of these brain similarity metrics. In particular, the largest DINOv3 models trained with the most human-centric images reach the highest brain-similarity. This emergence of brain-like representations in AI models follows a specific chronology during training: models first align with the early representations of the sensory cortices, and only align with the late and prefrontal representations of the brain with considerably more training. Finally, this developmental trajectory is indexed by both structural and functional properties of the human cortex: the representations that are acquired last by the models specifically align with the cortical areas with the largest developmental expansion, thickness, least myelination, and slowest timescales. Overall, these findings disentangle the interplay between architecture and experience in shaping how artificial neural networks come to see the world as humans do, thus offering a promising framework to understand how the human brain comes to represent its visual world.
Problem

Research questions and friction points this paper is trying to address.

Identifying factors driving brain-model similarity in vision
Disentangling model, training, and data effects on representations
Assessing how AI models develop brain-like visual processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised vision transformers with systematic factor variations
Multi-modal brain comparison using fMRI and MEG recordings
Three complementary metrics assessing representational similarity dynamics