Regress, Don't Guess - A Regression-like Loss on Number Tokens for Language Models

📅 2024-11-04

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

132K/year

🤖 AI Summary

Language models exhibit low numerical accuracy in arithmetic reasoning tasks, primarily because cross-entropy loss treats digit tokens as unordered categorical symbols, ignoring their inherent numerical proximity and ordinal structure. Method: We propose two regression-oriented training objectives—Lp loss and Wasserstein-1 distance—that explicitly model numeric tokens as ordered, continuous quantities rather than discrete symbols. These objectives are integrated into the T5 architecture and coupled with a probability-weighted numerical decoding strategy. Contribution/Results: Experiments on mathematical reasoning benchmarks demonstrate substantial improvements in numerical generation accuracy and significant reductions in absolute error—e.g., mean absolute error (MAE) decreases by up to 42%. Our approach overcomes the fundamental limitation of classification-based losses in capturing quantitative relationships, offering a scalable, plug-and-play paradigm to enhance language models’ quantitative reasoning capabilities.

Technology Category

Application Category

📝 Abstract

While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving reasoning over quantities, especially arithmetics. This has particular relevance in scientific datasets where combinations of text and numerical data are abundant. One fundamental limitation is the nature of the CE loss, which assumes a nominal (categorical) scale and thus cannot convey proximity between generated number tokens. As a remedy, we here present two versions of a number token loss. The first is based on an $L_p$ loss between the ground truth token value and the weighted sum of the predicted class probabilities. The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution. These regression-like losses can easily be added to any language model and extend the CE objective during training. We compare the proposed schemes on a mathematics dataset against existing tokenization, encoding, and decoding schemes for improving number representation in language models. Our results reveal a significant improvement in numerical accuracy when equipping a standard T5 model with the proposed loss schemes.

Problem

Research questions and friction points this paper is trying to address.

Language models lack inductive bias for number generation

Cross Entropy loss fails to convey number token proximity

Proposing Number Token Loss to improve math task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Regression-like loss on number tokens

Minimizes Lp norm or Wasserstein distance

Seamless integration into large language models

🔎 Similar Papers

No similar papers found.