λ-GELU: Learning Gating Hardness for Controlled ReLU-ization in Deep Networks

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work addresses the challenge of reconciling the training benefits of smooth activation functions with the deployment compatibility of ReLU. The authors propose λ-GELU, a GELU variant featuring a learnable sharpness parameter λ, which enables a controlled transition from smooth training to ReLU-equivalent inference. Through constrained reparameterization and an optimizer-aware update mechanism for λ, the method yields structured inter-layer sharpness distributions across diverse architectures—including MLPs, CNNs, and Transformers—and permits lossless post-training replacement of λ-GELU with ReLU. This substitution incurs minimal conversion interference, thereby achieving both high training efficiency and deployment friendliness.

Technology Category

Application Category

📝 Abstract

Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;λ)=xΦ(λ x), where Φ is the Gaussian CDF and λ \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning λ is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model--dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of λ-GELU by ReLU with reduced disruption. Overall, λ-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines.

Problem

Research questions and friction points this paper is trying to address.

GELU

ReLU

activation function

model conversion

gating hardness

Innovation

Methods, ideas, or system contributions that make the work stand out.

λ-GELU

gating hardness

ReLU-ization