🤖 AI Summary
This work addresses a critical gap in existing theoretical analyses of neural network expressivity, which typically assume exact real-number computation and overlook the practical effects of finite floating-point precision, arbitrary reduction orders, and inexact implementations of activation functions. The paper introduces a general differentiability framework and establishes, for the first time, necessary and sufficient conditions under which floating-point neural networks retain universal approximation capability within a realistic floating-point semantics that incorporates unit-in-the-last-place (ulp) errors and arbitrary reduction orders. By integrating floating-point error analysis, differentiability theory, and function approximation techniques, the study demonstrates that widely used activation functions—including Sigmoid, tanh, ReLU, ELU, SeLU, GeLU, Swish, Mish, and sin—preserve universal representational power even under these more practical computational constraints, thereby substantially extending prior theoretical results limited to idealized assumptions.
📝 Abstract
Most existing expressivity theories for neural networks assume exact real arithmetic, whereas practical neural networks are executed under finite-precision floating-point arithmetic with implementation-dependent execution semantics. Recent works have begun studying the expressive power of floating-point neural networks, but existing results are limited to highly restricted activation functions and idealized assumptions such as fixed left-to-right reduction orders and correctly rounded activation implementations. In this work, we study the expressive power of floating-point neural networks under generalized floating-point execution semantics, including arbitrary reduction orders and inexact activation implementations with bounded ulp errors. We investigate when floating-point neural networks can represent arbitrary functions between floating-point domains exactly. To this end, we introduce a general distinguishability framework and show that the ability to distinguish every pair of distinct inputs in the first layer is necessary for universal representability. This characterization yields broad classes of activation implementations that are not universal representators, extending previous isolated counterexamples such as the correctly rounded cosine activation. We further prove that a suitable form of distinguishability is also sufficient for universal representability under mild conditions on the activation implementation. Using this framework, we establish universal representability results for a broad class of practical activation functions, including implementations of $\mathrm{Sigmoid}$, $\tanh$, $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{SeLU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Mish}$, and $\sin$, under significantly more realistic floating-point execution models than previously known.