š¤ AI Summary
Current vision models exhibit significant bottlenecks in compositional generalizationāthe ability to reason about novel combinations of learned conceptsāhindering progress toward human-like intelligence. To address this, we introduce GridWorld, a modular and scalable visual reasoning benchmark framework. It features a rule-based grid-world task system supporting 28 composable transformations and fine-grained environmental control, coupled with a tunable-depth composition mechanism that enables generation of over one million unique tasksāsurpassing both scale and complexity of existing benchmarks. Leveraging synthetic data generation and controlled experimental design, we systematically demonstrate, for the first time, a sharp degradation (>40% average drop) in mainstream modelsā generalization performance on āfamiliar elements in novel combinationsā tasks, exposing a fundamental challenge in compositional reasoning. GridWorld thus establishes a new, highly flexible and controllable benchmark for rigorously evaluating and advancing compositional capabilities in vision models.
š Abstract
The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI's problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules -- surpassing concurrent datasets by several orders of magnitude -- across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.