🤖 AI Summary
This work addresses the challenge that large language models (LLMs) struggle to accurately model numerical distributions in prediction tasks due to their reliance on generic maximum likelihood estimation, often yielding overly peaked or excessively flat outputs. To overcome this limitation, the authors propose Digit Entropy Loss (DEL), a novel approach that reformulates entropy optimization as a supervised learning objective, eliminating dependence on numerical distance metrics. DEL leverages a unified serialized representation of floating-point numbers to handle both integers and decimals consistently and employs autoregressive modeling based on digit-wise conditional probabilities and binary cross-entropy. Evaluated across seven mathematical reasoning benchmarks and four mainstream LLMs, DEL consistently outperforms existing methods, achieving significant improvements in both prediction accuracy and numerical distance metrics.
📝 Abstract
Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL