Boosting Performance on ARC is a Matter of Perspective

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited performance on abstract reasoning tasks—particularly the ARC-AGI benchmark—due to insufficient generalization and lack of structured search over solution spaces. Method: This paper proposes a task-customized, unified generation-and-scoring framework: (1) leveraging the same LLM both as a candidate solution generator and as a self-scorer via output token probabilities; (2) introducing multi-stage, task-aware data augmentation; (3) integrating low-overhead depth-first search (DFS) to efficiently explore high-probability solution regions; and (4) applying probability-weighted candidate filtering to enhance decision robustness. Results: The method achieves 71.6% accuracy (286.5/400) on the public ARC-AGI test set—the state-of-the-art among open-source approaches—with an average per-task inference cost of ~$0.02 on an NVIDIA RTX 4090. It offers high performance, full transparency (no black-box components), and strong reproducibility.

Technology Category

Application Category

📝 Abstract

The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware (we assume a price of 36ct/hour for a Nvidia 4090 GPU).

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' abstract reasoning on ARC-AGI

Generating diverse solutions via depth-first search

Balancing performance, cost, and transparency in ARC solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-specific data augmentations in all phases

Depth-first search for diverse candidate solutions

LLM as both generator and scorer

🔎 Similar Papers

No similar papers found.