COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current vision models exhibit significant bottlenecks in compositional generalization—the ability to reason about novel combinations of learned concepts—hindering progress toward human-like intelligence. To address this, we introduce GridWorld, a modular and scalable visual reasoning benchmark framework. It features a rule-based grid-world task system supporting 28 composable transformations and fine-grained environmental control, coupled with a tunable-depth composition mechanism that enables generation of over one million unique tasks—surpassing both scale and complexity of existing benchmarks. Leveraging synthetic data generation and controlled experimental design, we systematically demonstrate, for the first time, a sharp degradation (>40% average drop) in mainstream models’ generalization performance on “familiar elements in novel combinations” tasks, exposing a fundamental challenge in compositional reasoning. GridWorld thus establishes a new, highly flexible and controllable benchmark for rigorously evaluating and advancing compositional capabilities in vision models.

Technology Category

Application Category

📝 Abstract

The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI's problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules -- surpassing concurrent datasets by several orders of magnitude -- across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.

Problem

Research questions and friction points this paper is trying to address.

Studying compositionality and generalization in visual reasoning

Addressing machine learning models' limitation in novel concept application

Creating a benchmark for systematic visual domain generalization analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular framework for visual reasoning tasks

Rule-based transformations with adjustable composition depth

Generates millions of unique task rules

🔎 Similar Papers

VCD: Knowledge Base Guided Visual Commonsense Discovery in Images