🤖 AI Summary
Large language models (LLMs) exhibit a sharp performance decline in multi-operand addition, stemming from an inherent limitation in their autoregressive generation mechanism: reliance on a one-digit lookahead heuristic, which fails to capture the long-range dependencies required for modeling cascading carries.
Method: Through targeted probing experiments and per-digit accuracy analysis—combined with systematic evaluation across diverse tokenization strategies—we diagnose mainstream LLMs on addition tasks involving three or more operands.
Results: All models exhibit carry error rates exceeding 80% for three-digit and larger additions, confirming the universality and structural nature of this bottleneck. This work identifies one-digit lookahead as a fundamental constraint on LLMs’ numerical reasoning capabilities, challenging the prevailing assumption that arithmetic deficits can be resolved solely through tokenization optimization. Our findings provide critical theoretical insight and empirical evidence for understanding the symbolic reasoning boundaries of large language models.
📝 Abstract
Autoregressive large language models (LLMs) exhibit impressive performance across various tasks but struggle with simple arithmetic, such as addition of two or more operands. We show that this struggle arises from LLMs' use of a simple one-digit lookahead heuristic, which works fairly well (but not perfect) for two-operand addition but fails in multi-operand cases, where the carry-over logic is more complex. Our probing experiments and digit-wise accuracy evaluation show that LLMs fail precisely where a one-digit lookahead is insufficient to account for cascading carries. We analyze the impact of tokenization strategies on arithmetic performance and show that all investigated models, regardless of tokenization, are inherently limited in the addition of multiple operands due to their reliance on a one-digit lookahead heuristic. Our findings reveal fundamental limitations that prevent LLMs from generalizing to more complex numerical reasoning.