🤖 AI Summary
This paper investigates the minimal width required for feedforward neural networks with compressible activation functions to achieve universal approximation of functions from $[0,1]^{d_x}$ to $mathbb{R}^{d_y}$ in the $L^p$ sense. Compressibility—defined as the capacity of an activation function to arbitrarily approximate both the identity and the binary step function via affine compositions—unifies two fundamental approximation capabilities. The authors rigorously establish that, for all non-affine analytic functions and a broad class of piecewise functions, the tight lower bound on minimal width is $max{d_x, d_y, 2}$; remarkably, this bound remains sharp even when $d_x = d_y = 1$ and the target function is monotonic. This work constitutes the first systematic generalization of ReLU-type minimal-width results to a wide family of nonlinear activations. It introduces the novel concept of *compressibility*, provides verifiable sufficient conditions for it, and thereby significantly extends both the applicability and structural understanding of universal approximation theory.
📝 Abstract
The exact minimum width that allows for universal approximation of unbounded-depth networks is known only for ReLU and its variants. In this work, we study the minimum width of networks using general activation functions. Specifically, we focus on squashable functions that can approximate the identity function and binary step function by alternatively composing with affine transformations. We show that for networks using a squashable activation function to universally approximate $L^p$ functions from $[0,1]^{d_x}$ to $mathbb R^{d_y}$, the minimum width is $max{d_x,d_y,2}$ unless $d_x=d_y=1$; the same bound holds for $d_x=d_y=1$ if the activation function is monotone. We then provide sufficient conditions for squashability and show that all non-affine analytic functions and a class of piecewise functions are squashable, i.e., our minimum width result holds for those general classes of activation functions.