LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of significantly improving numerical precision in large language model inference with minimal computational overhead. The authors propose a proactive mixed-precision inference method that, for the first time, leverages rounding error analysis of composite functions to guide precision allocation. By dynamically identifying critical components and selectively applying high-precision computation only where necessary, the approach integrates fine-grained precision scheduling with floating-point error modeling to enable adaptive mixed-precision inference within the Transformer architecture. Experimental results on GPT-2 demonstrate that deploying high-precision recomputation on only a small fraction of operations yields up to two orders of magnitude (100×) improvement in inference accuracy.

Technology Category

Application Category

📝 Abstract
Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.
Problem

Research questions and friction points this paper is trying to address.

mixed-precision inference
large language models
rounding error
transformer
floating-point computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-precision inference
rounding error analysis
adaptive precision allocation
transformer optimization
efficient LLM inference
🔎 Similar Papers
No similar papers found.
S
Stanislav S. Budzinskiy
Faculty of Mathematics, University of Vienna, Vienna, Austria
M
Marian Gloser
Faculty of Mathematics, University of Vienna, Vienna, Austria
T
Tolunay Yilmaz
Faculty of Mathematics, University of Vienna, Vienna, Austria
Y
Ying Hong Tham
Huawei Technologies
Y
Yuanyi Lin
Huawei Technologies
W
Wenyi Fang
Huawei Technologies
F
Fan Wu
Huawei Technologies
Philipp Petersen
Philipp Petersen
University of Vienna
Applied Harmonic AnalysisDifferential equationsNeural network approximation