🤖 AI Summary
This work investigates whether large language models (LLMs) possess genuine abstract reasoning capability, challenging the prevailing assumption that poor zero-shot performance implies an absence of such ability. Through systematic replication experiments, we propose and validate a lightweight intervention: fine-tuning only a small fraction (<0.1% of total parameters) of the input embedding layer, which yields near-perfect performance across multiple abstract reasoning benchmarks—including variants of Raven’s Progressive Matrices. Crucially, this gain fails to transfer across datasets, exposing a fundamental confound in current evaluation paradigms: the conflation of *task adaptation* with *intrinsic reasoning capacity*. Our core contributions are threefold: (i) a refined, empirically grounded definition of measurable abstract reasoning; (ii) identification of input representation sensitivity as the primary bottleneck; and (iii) a call for decoupled evaluation frameworks that explicitly separate representation learning from reasoning mechanisms.
📝 Abstract
Recent work has argued that large language models (LLMs) are not "abstract reasoners", citing their poor zero-shot performance on a variety of challenging tasks as evidence. We revisit these experiments in order to add nuance to the claim. First, we show that while LLMs indeed perform poorly in a zero-shot setting, even tuning a small subset of parameters for input encoding can enable near-perfect performance. However, we also show that this finetuning does not necessarily transfer across datasets. We take this collection of empirical results as an invitation to (re-)open the discussion of what it means to be an "abstract reasoner", and why it matters whether LLMs fit the bill.