Number Cookbook: Number Understanding of Language Models and How to Improve It

📅 2024-11-06
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit systematic deficiencies in foundational Numerical Understanding and Processing Ability (NUPA)—e.g., erroneously judging “9.11 > 9.9”—despite their broad capabilities. Method: We introduce the first education-aligned NUPA benchmark, grounded in K–12 curricula, covering four numeric representations, 17 tasks, and 41 task-representation combinations. We conduct controlled experiments with small models to isolate effects of tokenization, positional encoding, and numeric formatting; perform supervised fine-tuning on large LLMs; and analyze chain-of-thought (CoT) prompting efficacy. Contribution/Results: State-of-the-art LLMs show weak performance across most basic numerical tasks. Fine-tuning substantially improves accuracy, yet conventional architectural enhancements—such as specialized tokenizers or positional encodings—prove ineffective in pretrained models. CoT yields marginal gains for elementary numerical judgments. We publicly release the benchmark, evaluation suite, and code to establish a measurable, improvable research paradigm for NUPA.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as 9.11>9.9). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work provides a more detailed and comprehensive understanding of NUPA in LLMs. Our benchmark and code are released at https://github.com/GraphPKU/number_cookbook.
Problem

Research questions and friction points this paper is trying to address.

Investigates numerical understanding and processing in large language models.
Introduces a benchmark for evaluating numerical tasks in LLMs.
Explores techniques to improve numerical abilities in pretrained models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed benchmark for numerical understanding tasks
Trained small models with NUPA enhancement techniques
Explored chain-of-thought impact on numerical processing
🔎 Similar Papers
No similar papers found.