🤖 AI Summary
This work investigates the capacity of deep neural networks to learn and generalize visual “same–different” abstract relational reasoning. Addressing the poor out-of-distribution (OOD) generalization of existing models on this fundamental relational task, we systematically evaluate diverse architectures—including pretrained vision Transformers (e.g., ViT) versus CNNs—along with pretraining paradigms and fine-tuning strategies, using ablation studies and cross-distribution generalization benchmarks on abstract geometric shape datasets. Key findings: after fine-tuning on minimal, texture- and color-free abstract shape data, pretrained vision Transformers achieve near-perfect OOD accuracy—substantially outperforming CNNs. We provide the first empirical evidence that pretrained Transformers possess strong abstract relational transfer capability; moreover, the quality of abstract relational representations proves more critical for generalization than low-level visual cues. These results reveal both the latent potential of deep networks for learning abstract visual relations and a viable pathway toward achieving it.
📝 Abstract
Although deep neural networks can achieve human-level performance on many object recognition benchmarks, prior work suggests that these same models fail to learn simple abstract relations, such as determining whether two objects are the same or different. Much of this prior work focuses on training convolutional neural networks to classify images of two same or two different abstract shapes, testing generalization on within-distribution stimuli. In this article, we comprehensively study whether deep neural networks can acquire and generalize same-different relations both within and out-of-distribution using a variety of architectures, forms of pretraining, and fine-tuning datasets. We find that certain pretrained transformers can learn a same-different relation that generalizes with near perfect accuracy to out-of-distribution stimuli. Furthermore, we find that fine-tuning on abstract shapes that lack texture or color provides the strongest out-of-distribution generalization. Our results suggest that, with the right approach, deep neural networks can learn generalizable same-different visual relations.