What is an "Abstract Reasoner"? Revisiting Experiments and Arguments about Large Language Models

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) possess genuine abstract reasoning capability, challenging the prevailing assumption that poor zero-shot performance implies an absence of such ability. Through systematic replication experiments, we propose and validate a lightweight intervention: fine-tuning only a small fraction (<0.1% of total parameters) of the input embedding layer, which yields near-perfect performance across multiple abstract reasoning benchmarks—including variants of Raven’s Progressive Matrices. Crucially, this gain fails to transfer across datasets, exposing a fundamental confound in current evaluation paradigms: the conflation of *task adaptation* with *intrinsic reasoning capacity*. Our core contributions are threefold: (i) a refined, empirically grounded definition of measurable abstract reasoning; (ii) identification of input representation sensitivity as the primary bottleneck; and (iii) a call for decoupled evaluation frameworks that explicitly separate representation learning from reasoning mechanisms.

Technology Category

Application Category

📝 Abstract

Recent work has argued that large language models (LLMs) are not "abstract reasoners", citing their poor zero-shot performance on a variety of challenging tasks as evidence. We revisit these experiments in order to add nuance to the claim. First, we show that while LLMs indeed perform poorly in a zero-shot setting, even tuning a small subset of parameters for input encoding can enable near-perfect performance. However, we also show that this finetuning does not necessarily transfer across datasets. We take this collection of empirical results as an invitation to (re-)open the discussion of what it means to be an "abstract reasoner", and why it matters whether LLMs fit the bill.

Problem

Research questions and friction points this paper is trying to address.

Assess LLMs' capability as abstract reasoners

Improve zero-shot performance via parameter tuning

Examine finetuning transferability across datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning small parameter subsets for input encoding

Evaluating transferability across different datasets

Revisiting abstract reasoning definition for LLMs

🔎 Similar Papers

No similar papers found.