🤖 AI Summary
This work investigates whether large language models (LLMs) possess genuine conceptual understanding of addition principles—or merely rely on superficial pattern memorization. Method: We systematically evaluate arithmetic generalization across three dimensions: (i) numeric addition within [0, 2⁶⁴], (ii) symbolic mapping generalization (e.g., 7→y), and (iii) commutativity consistency, using zero-shot inference, interpretability analysis (self-explanation vs. rule injection), and large-scale sampling. Contribution/Results: While numeric accuracy is high (73.8%–99.8%), symbolic generalization fails catastrophically (≤7.5%), and over 1,700 commutativity violations are observed. We introduce the first joint diagnostic—equivalent symbolic mapping plus commutativity consistency—to rigorously identify arithmetic generalization failure. Crucially, explicit injection of addition rules degrades performance by 81.2%, challenging the “prompt-as-reasoning” paradigm. These findings indicate that LLMs lack internalized, rule-based understanding of addition’s fundamental principles.
📝 Abstract
Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition ($0$ to $2^{64}$), probing two core properties: commutativity ($A+B=B+A$) and compositional generalization (via isomorphic symbolic mappings, e.g., $7
ightarrow y$). While state-of-the-art LLMs achieve 73.8-99.8% accuracy on numerical addition, performance collapses to $leq$7.5% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of $A+B
eq B+A$) further support this. Explicitly providing addition rules degrades performance by 81.2% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.