🤖 AI Summary
Neural network representations are hindered by symmetries such as unit permutation and rotational invariance in activation space, which impede instance-level comparisons across models. This work proposes a centroid alignment framework that eliminates these redundant symmetries to construct a unified embedding space, enabling, for the first time, instance-level representational comparability across models, modalities, and even individuals (e.g., brain regions). By integrating geometric structure alignment with posterior embedding construction, the method is applicable to vision, language, and neuroimaging data. It not only identifies input features that drive representational convergence or divergence but also matches the performance of contrastive learning models in cross-modal alignment, while uncovering fine-grained patterns of representational consistency in the human visual system.
📝 Abstract
Comparing representations across neural networks is challenging because representations admit symmetries, such as arbitrary reordering of units or rotations of activation space, that obscure underlying equivalence between models. We introduce a barycentric alignment framework that quotients out these nuisance symmetries to construct a universal embedding space across many models. Unlike existing similarity measures, which summarize relationships over entire stimulus sets, this framework enables similarity to be defined at the level of individual stimuli, revealing inputs that elicit convergent versus divergent representations across models. Using this instance-level notion of similarity, we identify systematic input properties that predict representational convergence versus divergence across vision and language model families. We also construct universal embedding spaces for brain representations across individuals and cortical regions, enabling instance-level comparison of representational agreement across stages of the human visual hierarchy. Finally, we apply the same barycentric alignment framework to purely unimodal vision and language models and find that post-hoc alignment into a shared space yields image text similarity scores that closely track human cross-modal judgments and approach the performance of contrastively trained vision-language models. This strikingly suggests that independently learned representations already share sufficient geometric structure for human-aligned cross-modal comparison. Together, these results show that resolving representational similarity at the level of individual stimuli reveals phenomena that cannot be detected by set-level comparison metrics.