Can LLMs subtract numbers?

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This paper identifies a systematic sign omission flaw in large language models (LLMs) during subtraction: when the minuend is less than the subtrahend (a < b), models frequently output the correct absolute value but omit the negative sign—indicating that sign information is internally encoded but fails to map accurately during token generation. Through comprehensive evaluation across eight LLMs, arithmetic task pairs (addition vs. subtraction), few-shot prompting, instruction fine-tuning, and probing analyses, we find subtraction accuracy consistently and significantly lower than addition, exposing a critical inconsistency in LLMs’ reasoning and generation for non-commutative operations. Our key contributions are threefold: (1) we pinpoint the root cause of this failure specifically to the generation stage—not representation or reasoning; (2) we demonstrate that lightweight instruction fine-tuning fully restores negative-sign generation capability (achieving ~99% accuracy); and (3) we show that few-shot prompting yields only marginal improvement, highlighting its limitations for correcting structural output biases.

Technology Category

Application Category

📝 Abstract

We present a systematic study of subtraction in large language models (LLMs). While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($a<b$). In such cases, LLMs frequently produce the correct magnitude but omit the negative sign. Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs. We further test well-known techniques such as few-shot learning and instruction-tuning to see if they can improve the LLMs'performance. Our results suggest that while few-shot prompting yields modest gains, the instruction-tuned models achieve near-perfect accuracies in generating the negative sign. Together, these findings provide a clearer characterization of the limitations and recoverability of LLMs'arithmetic capabilities in subtraction.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with subtraction accuracy compared to addition operations

Models frequently omit negative signs when subtracting smaller from larger numbers

Instruction-tuning significantly improves negative sign generation in subtraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated subtraction performance across eight pretrained LLMs

Identified negative sign omission as key error pattern

Achieved near-perfect accuracy with instruction-tuned models

🔎 Similar Papers

No similar papers found.