Regress, Don't Guess - A Regression-like Loss on Number Tokens for Language Models

📅 2024-11-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Language models exhibit low numerical accuracy in arithmetic reasoning tasks, primarily because cross-entropy loss treats digit tokens as unordered categorical symbols, ignoring their inherent numerical proximity and ordinal structure. Method: We propose two regression-oriented training objectives—Lp loss and Wasserstein-1 distance—that explicitly model numeric tokens as ordered, continuous quantities rather than discrete symbols. These objectives are integrated into the T5 architecture and coupled with a probability-weighted numerical decoding strategy. Contribution/Results: Experiments on mathematical reasoning benchmarks demonstrate substantial improvements in numerical generation accuracy and significant reductions in absolute error—e.g., mean absolute error (MAE) decreases by up to 42%. Our approach overcomes the fundamental limitation of classification-based losses in capturing quantitative relationships, offering a scalable, plug-and-play paradigm to enhance language models’ quantitative reasoning capabilities.

Technology Category

Application Category

📝 Abstract
While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving reasoning over quantities, especially arithmetics. This has particular relevance in scientific datasets where combinations of text and numerical data are abundant. One fundamental limitation is the nature of the CE loss, which assumes a nominal (categorical) scale and thus cannot convey proximity between generated number tokens. As a remedy, we here present two versions of a number token loss. The first is based on an $L_p$ loss between the ground truth token value and the weighted sum of the predicted class probabilities. The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution. These regression-like losses can easily be added to any language model and extend the CE objective during training. We compare the proposed schemes on a mathematics dataset against existing tokenization, encoding, and decoding schemes for improving number representation in language models. Our results reveal a significant improvement in numerical accuracy when equipping a standard T5 model with the proposed loss schemes.
Problem

Research questions and friction points this paper is trying to address.

Language models lack inductive bias for number generation
Cross Entropy loss fails to convey number token proximity
Proposing Number Token Loss to improve math task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Regression-like loss on number tokens
Minimizes Lp norm or Wasserstein distance
Seamless integration into large language models
🔎 Similar Papers
No similar papers found.
J
Jonas Zausinger
TU Munich, Germany; TUM.AI, Germany
L
Lars Pennig
TU Munich, Germany; TUM.AI, Germany
K
Kacper Chlodny
TU Munich, Germany; TUM.AI, Germany
V
Vincent Limbach
TU Munich, Germany; TUM.AI, Germany
A
Anna Ketteler
TU Munich, Germany; TUM.AI, Germany
Thorben Prein
Thorben Prein
Technische Universität München
Materials Informatics
V
Vishwa Mohan Singh
LMU Munich, Germany
M
M. M. Danziger
IBM Research Europe, Switzerland
Jannis Born
Jannis Born
IBM Research
AI 4 ScienceLanguage ModelsQuantum MLMachine Learning