Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) possess genuine conceptual understanding of addition principles—or merely rely on superficial pattern memorization. Method: We systematically evaluate arithmetic generalization across three dimensions: (i) numeric addition within [0, 2⁶⁴], (ii) symbolic mapping generalization (e.g., 7→y), and (iii) commutativity consistency, using zero-shot inference, interpretability analysis (self-explanation vs. rule injection), and large-scale sampling. Contribution/Results: While numeric accuracy is high (73.8%–99.8%), symbolic generalization fails catastrophically (≤7.5%), and over 1,700 commutativity violations are observed. We introduce the first joint diagnostic—equivalent symbolic mapping plus commutativity consistency—to rigorously identify arithmetic generalization failure. Crucially, explicit injection of addition rules degrades performance by 81.2%, challenging the “prompt-as-reasoning” paradigm. These findings indicate that LLMs lack internalized, rule-based understanding of addition’s fundamental principles.

Technology Category

Application Category

📝 Abstract

Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition ($0$ to $2^{64}$), probing two core properties: commutativity ($A+B=B+A$) and compositional generalization (via isomorphic symbolic mappings, e.g., $7 ightarrow y$). While state-of-the-art LLMs achieve 73.8-99.8% accuracy on numerical addition, performance collapses to $leq$7.5% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of $A+B eq B+A$) further support this. Explicitly providing addition rules degrades performance by 81.2% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Do LLMs learn math principles or memorize patterns?

Can LLMs generalize addition rules under symbolic mapping?

Do LLMs follow human-defined arithmetic principles correctly?

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probes LLMs with elementary two-integer addition

Tests commutativity and compositional generalization

Uses symbolic mappings to reveal rule learning failures

🔎 Similar Papers

No similar papers found.