The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This study investigates the fundamental reasons why Gated Linear Units (GLUs) empirically outperform non-gated architectures. By analyzing two-layer networks through the Neural Tangent Kernel (NTK) framework, the work reveals—through the lens of NTK spectra and condition numbers—that GLUs accelerate training convergence by reshaping the kernel spectrum, reducing the condition number, and compressing the eigenvalue distribution. The analysis successfully reproduces the loss crossing phenomenon observed in practical models such as Vision Transformers (ViTs) and GPT-2. Both theoretical and empirical results demonstrate that the primary advantage of GLUs lies in improved optimization dynamics rather than enhanced generalization, as their impact on the generalization gap is limited. This work thus offers a novel theoretical perspective on the optimization benefits conferred by gating mechanisms.

📝 Abstract

Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap.

Problem

Research questions and friction points this paper is trying to address.

Gated Linear Units

condition number

neural tangent kernel

optimization

generalization gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated Linear Units

Neural Tangent Kernel

Condition Number