Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the longstanding trade-off between optimization stability and computational efficiency in activation functions: piecewise linear functions suffer from non-differentiability at the origin, leading to training instability, while smooth alternatives incur high computational overhead due to transcendental operations. To resolve this, the authors propose BerLU, a novel activation function based on Bernstein polynomials that introduces a differentiable quadratic transition region. This design eliminates singularities while preserving the piecewise linear structure, yielding a continuously differentiable function with a Lipschitz constant of one. Notably, this is the first application of constructive approximation theory to activation function design, offering both theoretical guarantees and practical efficiency. Experiments demonstrate that BerLU consistently outperforms established activations across Vision Transformer and CNN architectures, achieving higher accuracy on ImageNet and other image classification benchmarks while improving computational and memory efficiency.
📝 Abstract
The efficacy of deep neural networks is heavily reliant on the design of non-linear activation functions, yet existing approaches often struggle to balance optimization stability with computational efficiency. While piecewise linear functions offer inference speed, they suffer from optimization instability due to non-differentiability at the origin, whereas smooth counterparts typically incur significant computational overhead through their reliance on transcendental operations. To address these limitations, this paper proposes a general smoothing framework based on constructive approximation theory and introduces the Bernstein Linear Unit (BerLU). This novel activation function utilizes Bernstein polynomials to construct a differentiable quadratic transition region that effectively eliminates singularities while maintaining a piecewise linear structure. Theoretical analysis demonstrates that the proposed method guarantees strictly continuous differentiability and a non-expansive Lipschitz constant of one, which ensures stable gradient propagation and prevents the gradient explosion problems common in deep architectures. Comprehensive empirical evaluations across representative Vision Transformer and Convolutional Neural Network architectures confirm that this approach consistently outperforms state-of-the-art baselines on standard image classification benchmarks while delivering superior computational and memory efficiency.
Problem

Research questions and friction points this paper is trying to address.

activation functions
optimization stability
computational efficiency
non-differentiability
smoothness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bernstein polynomials
activation function
smooth approximation
Lipschitz continuity
constructive approximation
🔎 Similar Papers
No similar papers found.