On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how overparameterized two-layer neural networks solve modular addition tasks by learning Fourier features. By analyzing gradient flow, leveraging comparison lemmas for ordinary differential equations, and invoking spectral initialization theory, the authors construct an interlayer phase-coupling dynamical model that reveals a three-stage grokking process in feature learning. The core contribution lies in identifying “phase symmetry” and “frequency diversity” as critical conditions for differentiation, elucidating how noisy signals from individual neurons achieve globally robust inference through a majority-vote mechanism. Furthermore, the study demonstrates that frequency competition is jointly governed by initial spectral amplitudes and phase alignment, and unifies the lottery ticket hypothesis with grokking within a single theoretical framework, highlighting the synergistic role of weight decay and loss minimization.

Technology Category

Application Category

📝 Abstract
We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the "winner" determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.
Problem

Research questions and friction points this paper is trying to address.

modular addition
Fourier features
neural network dynamics
grokking
feature combination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fourier features
lottery ticket hypothesis
phase symmetry
grokking dynamics
overparameterized neural networks
J
Jianliang He
Department of Statistics and Data Science, Yale University
L
Leda Wang
Department of Statistics and Data Science, Yale University
Siyu Chen
Siyu Chen
Ph.D. of S&DS, Yale University
StatisticsDLLLMeconomicsRL
Zhuoran Yang
Zhuoran Yang
Yale University
machine learningoptimizationreinforcement learningstatistics