π€ AI Summary
This study addresses the significant degradation in numerical reasoning performance of large language models when confronted with numeral systems or formats rarely seen in their training data. We systematically evaluate mainstream large language models across diverse numeral representations and, for the first time, reveal the critical impact of numeric format on model capabilities. To mitigate this limitation, we propose a targeted strategy that combines few-shot prompting with explicit numeral mapping, effectively enhancing the modelsβ cross-format generalization. Experimental results demonstrate that our approach substantially narrows the performance gap observed under non-standard numeral formats, offering a novel pathway toward improving numerical robustness in large language models.
π Abstract
Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical expressions deviate from the prevailing conventions present in their training corpora. In this work, we investigate numerical reasoning across a wide range of numeral scripts and formats. We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats, despite the underlying mathematical reasoning being identical. We further demonstrate that targeted prompting strategies, such as few-shot prompting and explicit numeral mapping, can greatly narrow this gap. Our findings highlight an overlooked challenge in multilingual numerical reasoning and provide actionable insights for working with LLMs to reliably interpret, manipulate, and generate numbers across diverse numeral scripts and formatting styles.