Out-of-distribution generalisation is hard: evidence from ARC-like tasks

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Out-of-distribution (OOD) generalization benchmarks alone cannot verify whether models have truly learned environment-invariant compositional structures—compositionality remains unvalidated even when OOD accuracy appears high. Method: We introduce the first explicit verification criterion for feature-level compositionality and propose two novel architectures explicitly endowed with strong compositional inductive biases. Contribution/Results: Under an ARC-style structured OOD evaluation protocol, we find that standard models—including MLPs, CNNs, and Transformers—fail on well-defined compositional OOD tasks. Although our new architectures achieve near-perfect OOD accuracy, fine-grained feature analysis reveals they still do not learn correct compositional representations. Crucially, our results demonstrate that compositionality must be actively and independently verified—not inferred from OOD performance alone—establishing a new, interpretable evaluation paradigm for OOD generalization grounded in representational analysis.

Technology Category

Application Category

📝 Abstract

Out-of-distribution (OOD) generalisation is considered a hallmark of human and animal intelligence. To achieve OOD through composition, a system must discover the environment-invariant properties of experienced input-output mappings and transfer them to novel inputs. This can be realised if an intelligent system can identify appropriate, task-invariant, and composable input features, as well as the composition methods, thus allowing it to act based not on the interpolation between learnt data points but on the task-invariant composition of those features. We propose that in order to confirm that an algorithm does indeed learn compositional structures from data, it is not enough to just test on an OOD setup, but one also needs to confirm that the features identified are indeed compositional. We showcase this by exploring two tasks with clearly defined OOD metrics that are not OOD solvable by three commonly used neural networks: a Multi-Layer Perceptron (MLP), a Convolutional Neural Network (CNN), and a Transformer. In addition, we develop two novel network architectures imbued with biases that allow them to be successful in OOD scenarios. We show that even with correct biases and almost perfect OOD performance, an algorithm can still fail to learn the correct features for compositional generalisation.

Problem

Research questions and friction points this paper is trying to address.

Studying out-of-distribution generalization in AI systems

Identifying compositional features for effective OOD transfer

Evaluating neural networks' failure in OOD scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies task-invariant composable input features

Develops novel network architectures for OOD

Tests compositional features beyond OOD metrics

🔎 Similar Papers

No similar papers found.