PowLU: An Activation Function for Stable Pre-Training of LLMs

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the numerical instability in large-scale low-precision training of large language models caused by the near-quadratic amplification effect of the SwiGLU activation function. To mitigate this issue, the authors propose PowLU, a novel activation function based on rational power functions that enables adaptive nonlinearity. PowLU preserves model expressivity while significantly enhancing training stability. Theoretical analysis and empirical evaluations demonstrate that PowLU matches the performance of SwiGLU and its clipped variants on the Ling architecture across both 7.9B and 124B parameter scales, while substantially improving the scalability and robustness of large-scale pretraining.

📝 Abstract

In contemporary large language models (LLMs), the swish-gated linear unit (SwiGLU) activation function is widely adopted to regulate the information flow and introduce non-linearity. For large positive inputs, SwiGLU approximates the quadratic function $x^2$, providing strong nonlinearity and expressive capacity. However, this property also causes numerical instability as the input or model scale increases, particularly in low-precision LLM training. The main reason is its approximate quadratic amplification, which enlarges the output range and exacerbates outliers. To address this issue, we propose a stable activation function, Power Linear Unit (PowLU), for large-scale LLM pre-training. Specifically, PowLU employs a rational power function to achieve adaptive nonlinearity, thereby improving representation ability and enabling stable training in spike regions. Moreover, we provide theoretical justification for several key properties of PowLU. Scaling law experiments confirm that the performance is consistent across model sizes, and further experimental results with the Ling architecture (7.9B and 124B total parameters) demonstrate that PowLU achieves competitive results against SwiGLU and SwiGLU-Clip in large-scale training of LLMs. In addition, the experimental results also show that PowLU effectively improves the scalability of the large-scale training of LLMs.

Problem

Research questions and friction points this paper is trying to address.

numerical instability

activation function

large language models

low-precision training

SwiGLU

Innovation

Methods, ideas, or system contributions that make the work stand out.

activation function

numerical stability

large language models