The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
This study investigates the fundamental reasons why Gated Linear Units (GLUs) empirically outperform non-gated architectures. By analyzing two-layer networks through the Neural Tangent Kernel (NTK) framework, the work reveals—through the lens of NTK spectra and condition numbers—that GLUs accelerate training convergence by reshaping the kernel spectrum, reducing the condition number, and compressing the eigenvalue distribution. The analysis successfully reproduces the loss crossing phenomenon observed in practical models such as Vision Transformers (ViTs) and GPT-2. Both theoretical and empirical results demonstrate that the primary advantage of GLUs lies in improved optimization dynamics rather than enhanced generalization, as their impact on the generalization gap is limited. This work thus offers a novel theoretical perspective on the optimization benefits conferred by gating mechanisms.
📝 Abstract
Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap.
Problem

Research questions and friction points this paper is trying to address.

Gated Linear Units
condition number
neural tangent kernel
optimization
generalization gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated Linear Units
Neural Tangent Kernel
Condition Number
Optimization Dynamics
Spectral Analysis
🔎 Similar Papers
No similar papers found.