🤖 AI Summary
This paper identifies a systematic sign omission flaw in large language models (LLMs) during subtraction: when the minuend is less than the subtrahend (a < b), models frequently output the correct absolute value but omit the negative sign—indicating that sign information is internally encoded but fails to map accurately during token generation. Through comprehensive evaluation across eight LLMs, arithmetic task pairs (addition vs. subtraction), few-shot prompting, instruction fine-tuning, and probing analyses, we find subtraction accuracy consistently and significantly lower than addition, exposing a critical inconsistency in LLMs’ reasoning and generation for non-commutative operations. Our key contributions are threefold: (1) we pinpoint the root cause of this failure specifically to the generation stage—not representation or reasoning; (2) we demonstrate that lightweight instruction fine-tuning fully restores negative-sign generation capability (achieving ~99% accuracy); and (3) we show that few-shot prompting yields only marginal improvement, highlighting its limitations for correcting structural output biases.
📝 Abstract
We present a systematic study of subtraction in large language models (LLMs). While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($a<b$). In such cases, LLMs frequently produce the correct magnitude but omit the negative sign. Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs. We further test well-known techniques such as few-shot learning and instruction-tuning to see if they can improve the LLMs'performance. Our results suggest that while few-shot prompting yields modest gains, the instruction-tuned models achieve near-perfect accuracies in generating the negative sign. Together, these findings provide a clearer characterization of the limitations and recoverability of LLMs'arithmetic capabilities in subtraction.