🤖 AI Summary
This study investigates whether diverse visual models share common, universal dimensions in representing object similarity and what underlies their emergence. Applying non-negative matrix factorization, the authors decompose object similarity structures from 162 distinct vision models into interpretable dimensions and distinguish universal from model-specific representations based on cross-model reproducibility. They systematically identify, for the first time, a set of universal dimensions that align closely with biological vision—specifically, neural responses in macaque inferotemporal (IT) cortex and human similarity judgments. These dimensions exhibit strong semantic interpretability and are primarily driven by conceptual attributes rather than engineering factors such as model architecture or training objectives. Moreover, models capturing more of these universal dimensions demonstrate superior performance in predicting both neural activity and human behavioral data.
📝 Abstract
Deep neural networks trained with different architectures, objectives, and datasets have been reported to converge on similar visual representations. However, what remains unknown is which visual properties models actually converge on and which factors may underlie this convergence. To address this, we decompose the object similarity structure of 162 diverse vision models into a small set of non-negative dimensions. To determine universal versus model-specific dimensions, we then estimate how often each dimension reappears across models. In contrast to model-specific dimensions, universal dimensions are more interpretable and more strongly driven by conceptual image properties, indicating the relevance of interpretability and semantic content as implicit factors driving universality across models. Differences in architecture, objective function, training data, model size, and model performance do not explain the emergence of universal dimensions. However, models with more universal dimensions also better predict macaque IT activity and human similarity judgments, suggesting that universality reflects representations relevant to biological vision. These findings have important implications for understanding the emergent representations underlying deep neural network models and their alignment with biological vision.