Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Large language models (LLMs), particularly reasoning-oriented ones, exhibit severe output irreproducibility across diverse hardware configurations (GPU models/counts, batch sizes) and low-precision formats (e.g., bfloat16), primarily due to floating-point non-associativity—early token rounding errors cascade and diverge chain-of-thought reasoning. This work is the first to systematically identify numerical precision as a critical bottleneck for LLM inference reproducibility, observing up to 9% accuracy fluctuation and ±9,000-token response length deviation on DeepSeek-R1-Distill-Qwen-7B. To address this, we propose LayerCast, a lightweight inference framework that employs FP32 activation computation with 16-bit weight storage—preserving memory efficiency while substantially improving numerical stability. We further introduce a cross-precision, cross-hardware reproducibility evaluation benchmark. Experiments demonstrate that LayerCast effectively suppresses output drift, achieving both high output consistency and efficient inference.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.

Problem

Research questions and friction points this paper is trying to address.

LLM performance reproducibility is fragile due to system configuration changes

Minor rounding differences in reasoning models cause divergent chains of thought

Floating-point precision affects reproducibility but is often neglected

Innovation

Methods, ideas, or system contributions that make the work stand out.

LayerCast pipeline for numerical stability

16-bit weights with FP32 computations

Investigates precision impact on reproducibility

🔎 Similar Papers

No similar papers found.