🤖 AI Summary
This study addresses the challenge of numerical instability in scientific software caused by floating-point precision errors, particularly in safety-critical contexts where traditional methods struggle with complex expressions. It presents the first systematic evaluation of large language models (LLMs) for improving numerical stability by detecting and rewriting unstable arithmetic expressions. The experiments encompass 2,470 expressions featuring nested conditionals, high-precision literals, and multi-variable arithmetic, evaluated across six prominent LLMs. Results demonstrate that LLMs outperform baseline methods in 65.4% of cases and successfully stabilize 97.9% of the 431 instances where baselines completely fail. Nevertheless, limitations persist in handling control flow constructs and high-precision literals, highlighting areas for future improvement.
📝 Abstract
Scientific software relies on high-precision computation, yet finite floating-point representations can introduce precision errors that propagate in safety-critical domains. Despite the growing use of large language models (LLMs) in scientific applications, their reliability in handling floating-point numerical stability has not been systematically evaluated. This paper evaluates LLMs' reasoning on high-precision numerical computation through two numerical stabilization tasks: (1) detecting instability in numerical expressions by generating error-inducing inputs (detection), and (2) rewriting expressions to improve numerical stability (stabilization). Using popular numerical benchmarks, we assess six LLMs on nearly 2,470 numerical structures, including nested conditionals, high-precision literals, and multi-variable arithmetic.
Our results show that LLMs are equally effective as state-of-the-art traditional approaches in detecting and stabilizing numerically unstable computations. More notably, LLMs outperform baseline methods precisely where the latter fail: in 17.4% (431) of expressions where the baseline does not improve accuracy, LLMs successfully stabilize 422 (97.9%) of them, and achieve greater stability than the baseline across 65.4% (1,615) of all expressions. However, LLMs struggle with control flow and high-precision literals, consistently removing such structures rather than reasoning about their numerical implications, whereas they perform substantially better on purely symbolic expressions. Together, these findings suggest that LLMs are effective at stabilizing expressions that classical techniques cannot, yet struggle when exact numerical magnitudes and control flow semantics must be precisely reasoned about, as such concrete patterns are rarely encountered during training.