🤖 AI Summary
This work investigates the adaptive approximation and learning capabilities of constant-depth neural networks with smooth activation functions for high-order smooth target functions. By constructing an explicit network architecture and analyzing approximation and estimation errors in Sobolev spaces, the authors demonstrate that the smoothness of the activation function alone enables constant-depth networks to automatically adapt to arbitrary levels of target smoothness, achieving minimax-optimal rates—up to logarithmic factors—in both approximation and statistical estimation errors. The study reveals that smooth activations can serve as a viable alternative to increasing network depth, overcoming the approximation-order bottleneck inherent in ReLU-type networks due to their non-smoothness. Notably, this result requires no sparsity assumptions and simultaneously ensures parameter controllability and statistical learnability.
📝 Abstract
Smooth activation functions are ubiquitous in modern deep learning, yet their theoretical advantages over non-smooth counterparts remain poorly understood. In this work, we characterize both approximation and statistical properties of neural networks with smooth activations over the Sobolev space $W^{s,\infty}([0,1]^d)$ for arbitrary smoothness $s>0$. We prove that constant-depth networks equipped with smooth activations automatically exploit arbitrarily high orders of target function smoothness, achieving the minimax-optimal approximation and estimation error rates (up to logarithmic factors). In sharp contrast, networks with non-smooth activations, such as ReLU, lack this adaptivity: their attainable approximation order is strictly limited by depth, and capturing higher-order smoothness requires proportional depth growth. These results identify activation smoothness as a fundamental mechanism, alternative to depth, for attaining statistical optimality. Technically, our results are established via a constructive approximation framework that produces explicit neural network approximators with carefully controlled parameter norms and model size. This complexity control ensures statistical learnability under empirical risk minimization (ERM) and removes the impractical sparsity constraints commonly required in prior analyses.