The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study addresses the lack of reproducibility in large language model (LLM) evaluations on standard benchmarks, which often stems from unreported differences in inference backends. The authors systematically evaluate output consistency and performance across five widely used inference engines—vLLM, SGLang, llama.cpp, and others—under fixed model weights, decoding parameters, and hardware conditions. Their analysis reveals, for the first time, that the choice of inference backend acts as a “silent hyperparameter,” significantly influencing benchmark outcomes: backend selection alone can cause score variations of up to 16.6 percentage points and lead to substantial output divergence. The work advocates for mandatory disclosure of the inference stack in experimental reporting to enhance the reproducibility and fairness of LLM evaluations.

📝 Abstract

Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends affect LLM benchmark results. Holding model weights, decoding parameters, and hardware constant, we evaluate five widely used inference engines, including vLLM, SGLang, and llama.cpp, across multiple open-weight models and established benchmarks. We show that the choice of backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. By isolating backend optimizations and tracing the execution pipeline, we find this divergence is driven by system-level optimizations like prefix caching and CUDA graphs, custom kernels, and engine-specific defaults in logit processing. Our findings identify the inference backend as a previously unreported but consequential hyperparameter in the evaluation of LLM and advocate standardized reporting of inference stacks to improve the reproducibility and interpretability of benchmark comparisons.

Problem

Research questions and friction points this paper is trying to address.

inference backend

LLM reproducibility

benchmarking

non-determinism

system-level optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference backend

LLM reproducibility

system-level optimization