🤖 AI Summary
This study investigates the marked decline in accuracy of large language models on multi-digit integer addition as operand length increases. Through systematic analysis of leading models—Claude Opus 4.1, GPT-5, and Gemini 2.5 Pro—the work identifies and quantifies two dominant error types: operand misalignment and carry failure. It further reveals that misalignment errors stem from tokenization mechanisms. Empirical evaluation and error attribution demonstrate that these two error categories account for 87.9%, 62.9%, and 92.4% of all addition failures in the respective models, with carry failures predominantly manifesting as random, independent events. These findings offer critical insights into the fundamental limitations of large language models in performing basic arithmetic reasoning.
📝 Abstract
Modern AI systems have been successfully deployed to win medals at international math competitions, assist with research workflows, and prove novel technical lemmas. However, despite their progress at advanced levels of mathematics, they remain stubbornly bad at basic arithmetic, consistently failing on the simple task of adding two numbers. We present a systematic investigation of this phenomenon. We demonstrate empirically that all frontier models suffer significantly degraded accuracy for integer addition as the number of digits increases. Furthermore, we show that most errors made by these models are highly interpretable and can be attributed to either operand misalignment or a failure to correctly carry; these two error classes explain 87.9%, 62.9%, and 92.4% of Claude Opus 4.1, GPT-5, and Gemini 2.5 Pro errors, respectively. Finally, we show that misalignment errors are frequently related to tokenization, and that carrying errors appear largely as independent random failures.