🤖 AI Summary
This work addresses the challenge of significantly improving numerical precision in large language model inference with minimal computational overhead. The authors propose a proactive mixed-precision inference method that, for the first time, leverages rounding error analysis of composite functions to guide precision allocation. By dynamically identifying critical components and selectively applying high-precision computation only where necessary, the approach integrates fine-grained precision scheduling with floating-point error modeling to enable adaptive mixed-precision inference within the Transformer architecture. Experimental results on GPT-2 demonstrate that deploying high-precision recomputation on only a small fraction of operations yields up to two orders of magnitude (100×) improvement in inference accuracy.
📝 Abstract
Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.