🤖 AI Summary
Convolutional neural networks (CNNs) exhibit significantly weaker generalization than humans on visual relational reasoning tasks—particularly same-different discrimination—under standard supervised learning.
Method: We propose a meta-learning framework that trains standard CNNs via Model-Agnostic Meta-Learning (MAML) on a distribution of diverse same-different tasks, systematically varying object instances, shapes, and viewpoints through task sampling. This enables the model to acquire abstract, task-invariant same-different relational representations.
Contribution/Results: Our approach is the first to demonstrate that off-the-shelf CNNs can learn structured relational abstractions through meta-training alone. Experiments show substantial improvements in zero-shot generalization to unseen categories, poses, and deformations, with accuracy approaching human-level performance. These results provide critical empirical evidence that CNNs—when appropriately meta-trained—can support compositional relational reasoning, overcoming longstanding generalization bottlenecks in visual relationship recognition.
📝 Abstract
While convolutional neural networks (CNNs) have come to match and exceed human performance in many settings, the tasks these models optimize for are largely constrained to the level of individual objects, such as classification and captioning. Humans remain vastly superior to CNNs in visual tasks involving relations, including the ability to identify two objects as `same' or `different'. A number of studies have shown that while CNNs can be coaxed into learning the same-different relation in some settings, they tend to generalize poorly to other instances of this relation. In this work we show that the same CNN architectures that fail to generalize the same-different relation with conventional training are able to succeed when trained via meta-learning, which explicitly encourages abstraction and generalization across tasks.