Beyond Reproducibility: Token Probabilities Expose Large Language Model Nondeterminism

📅 2026-01-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the non-deterministic outputs of large language models on GPUs, which persist even when deterministic modes are enabled due to variations in floating-point operation ordering. The work presents the first systematic analysis of token-level probability distribution shifts, revealing that probabilities in the 0.1–0.9 range are significantly affected by non-determinism, whereas extreme probabilities near 0 or 1 remain relatively stable. Through floating-point precision analysis under GPU execution environments and comparative experiments across multiple models, the authors propose a novel method that requires only a single inference pass to assess the impact of non-determinism. This approach offers both a new perspective and a practical tool for quantifying and understanding non-deterministic behavior in large model inference.

Technology Category

Application Category

📝 Abstract
The execution of Large Language Models (LLMs) has been shown to produce nondeterministic results when run on Graphics Processing Units (GPUs), even when they are configured to produce deterministic results. This is due to the finite precision effects of the arithmetic operations, which depend on the order in which they are executed. This order, in turn, depends on the processes that are running concurrently on the GPU. Previous studies have focused on the impact of nondeterminism on the text generated by the LLMs or on proposing mechanisms to achieve deterministic execution. This work takes a closer look at nondeterminism by analyzing the variations on the token probabilities, not on the generated text. Interestingly, all the models evaluated have similar results in both the trends and the actual values of the variations of the probabilities. In particular, the results show that the effects of nondeterminism are significant for token probabilities that are in the range of 0.1 to 0.9, while they are much smaller when the probabilities are close to 0 or 1. This has significant implications for our understanding of nondeterminism. The first is that nondeterminism will likely have a non-negligible impact on generated text when the temperature is not zero, as it introduces significant variations in the token probabilities except when they are close to 0 or 1. Secondly, it suggests that all models have similar non deterministic variations at the token probability level. Therefore, different variations in the performance of the generated text, for example, when measuring accuracy on a benchmark, seem to come from different token probabilities or response lengths. A third implication is that we may be able to estimate the impact of nondeterminism by running a single inference and analyzing the token level probabilities, instead of having to run the same inference many times.
Problem

Research questions and friction points this paper is trying to address.

nondeterminism
large language models
token probabilities
GPU execution
finite precision arithmetic
Innovation

Methods, ideas, or system contributions that make the work stand out.

nondeterminism
token probabilities
large language models
GPU arithmetic
reproducibility
🔎 Similar Papers
2024-08-16Conference on Empirical Methods in Natural Language ProcessingCitations: 6
T
Tairan Fu
Politecnico di Milano, Milano, Italy
Gonzalo Martínez
Gonzalo Martínez
Universidad Carlos III de Madrid
J
Javier Conde
Information Processing and Telecommunications Center, ETSI de Telecomunicación, Universidad Politécnica de Madrid, Madrid, Spain
C
Carlos Arriaga
Information Processing and Telecommunications Center, ETSI de Telecomunicación, Universidad Politécnica de Madrid, Madrid, Spain
P
Pedro Reviriego
Information Processing and Telecommunications Center, ETSI de Telecomunicación, Universidad Politécnica de Madrid, Madrid, Spain
X
Xiuyuan Qi
University of Electronic Science and Technology of China, Chengdu, China
Shanshan Liu
Shanshan Liu
Professor, University of Electronic Science and Technology of China
Fault tolerant electronicsemerging computingdependable machine learningerror correction codes