🤖 AI Summary
Large language models (LLMs) lack rigorous, theory-grounded benchmarks for evaluating abstract reasoning—the capacity to extract patterns and generalize rules beyond surface-level associations. Method: We propose the first theoretically motivated abstract reasoning benchmark, formalizing abstract reasoning as pattern extraction and rule generalization. To enforce essential generalization and suppress superficial matching, we design symbol remapping tasks. We introduce two novel metrics: γ-score (base reasoning accuracy) and δ-score (symbol dependency), enabling the first theoretical distinction between genuine abstraction and spurious memorization. Contribution/Results: Our benchmark evaluates open-source models (7B–70B), commercial APIs, and multi-agent systems on high-abstraction tasks—including non-decimal arithmetic. Experiments reveal fundamental deficiencies in LLMs’ symbol invariance and rule transfer. Crucially, δ-score robustly quantifies memory reliance, demonstrating that chain-of-thought and other prompting techniques fail to bridge the abstract reasoning gap.
📝 Abstract
In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: (scoreGamma) measures basic reasoning accuracy, while (scoreDelta) quantifies a model's reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multi-agent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) (scoreDelta)'s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. These findings underscore that current LLMs, despite domain-specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.