The ARC of Progress towards AGI: A Living Survey of Abstraction and Reasoning

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the cross-version generalization capabilities of AI systems on abstract reasoning tasks, revealing a significant performance drop when models encounter new benchmarks. By conducting the first cross-generational comparison of 82 methods across three versions of ARC-AGI and the ARC Prize competition—including program synthesis, neuro-symbolic systems, large language models, and test-time optimization—the work establishes a dynamic evaluation framework that integrates synthetic data and resource-constrained training. Results show that top-performing systems achieve 93.0% accuracy on ARC-AGI-1 but suffer sharp declines to 68.8% and 13% on subsequent versions, substantially underperforming humans. Notably, inference costs have decreased by a factor of 390 within a year, highlighting the efficient learning potential of models developed under Kaggle’s constraints.

Technology Category

Application Category

📝 Abstract
The Abstraction and Reasoning Corpus (ARC-AGI) has become a key benchmark for fluid intelligence in AI. This survey presents the first cross-generation analysis of 82 approaches across three benchmark versions and the ARC Prize 2024-2025 competitions. Our central finding is that performance degradation across versions is consistent across all paradigms: program synthesis, neuro-symbolic, and neural approaches all exhibit 2-3x drops from ARC-AGI-1 to ARC-AGI-2, indicating fundamental limitations in compositional generalization. While systems now reach 93.0% on ARC-AGI-1 (Opus 4.6), performance falls to 68.8% on ARC-AGI-2 and 13% on ARC-AGI-3, as humans maintain near-perfect accuracy across all versions. Cost fell 390x in one year (o3's $4,500/task to GPT-5.2's $12/task), although this largely reflects reduced test-time parallelism. Trillion-scale models vary widely in score and cost, while Kaggle-constrained entries (660M-8B) achieve competitive results, aligning with Chollet's thesis that intelligence is skill-acquisition efficiency. Test-time adaptation and refinement loops emerge as critical success factors, while compositional reasoning and interactive learning remain unsolved. ARC Prize 2025 winners needed hundreds of thousands of synthetic examples to reach 24% on ARC-AGI-2, confirming that reasoning remains knowledge-bound. This first release of the ARC-AGI Living Survey captures the field as of February 2026, with updates at https://nimi-ai.com/arc-survey/
Problem

Research questions and friction points this paper is trying to address.

compositional generalization
abstract reasoning
fluid intelligence
ARC-AGI
reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional generalization
test-time adaptation
ARC-AGI benchmark
fluid intelligence
skill-acquisition efficiency
🔎 Similar Papers
No similar papers found.