The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the intractability of computing the global empirical Neural Tangent Kernel (NTK) in finite-width neural networks, which hinders a precise understanding of gradient descent dynamics. Viewing model states as solutions to implicit constraints, the authors propose an operator factorization framework that decomposes the NTK into a product of a parameter–state interaction operator $ K $ and a state–state dependency operator $ P $. They establish a general Kronecker kernel theorem, proving that $ K $ admits an exact representation as a function of the Gram matrix of weight sites, thereby revealing the intrinsic low-rank structure of the NTK and its induced learning bias. Leveraging implicit modeling, matrix-free randomized linear algebra, and kpflow implementation, the method enables efficient NTK computation and demonstrates that architectures such as RNNs and Transformers inherently yield low-rank NTKs, clarifying how gradient descent favors dominant modes and how initialization constrains task-specific learning capacity.

📝 Abstract

In training a neural network with gradient descent (GD), each iteration induces a linear operator that governs first-order updates to a model's internal state variables. We define this operator as the Global Empirical Neural Tangent Kernel (NTK). In finite-width networks, the NTK is typically intractable to form, leading prior work to focus on restrictive settings such as tracking outputs only or taking infinite-width limits. Here, we study the structure of the NTK for a range of models. Formulating the model state as the solution to a single global implicit constraint, we derive the NTK as a product of two operators: K, accounting for immediate parameter-to-state interactions, and P, describing internal state-to-state dependencies. For a broad class of weight-based models, including RNNs and transformers, we prove a universal Kronecker-core theorem showing that K admits an exact, computable form given by the Gram matrix of weight-site variables. This core structure reveals that the NTK is structurally bottlenecked, constraining its effective rank and giving rise to a self-referential bias whereby GD preferentially learns within dominant modes of joint hidden and input activity. For recurrent models, we examine the spectrum of the NTK and show when it is biased and low-rank in space or time under the proposed decomposition. We further demonstrate that model dynamics at initialization bias the NTK, restricting learning and preventing task components from being learned effectively. Finally, we show that the NTK associated with a self-attention transformer is likewise structurally constrained to be low-rank. Overall, we show that the NTK possesses tractable structure that explains GD bias toward task solutions and the emergence of low-rank representations. To enable use of the NTK as a practical metric, we build kpflow, a library relying on randomized matrix-free numerical linear algebra.

Problem

Research questions and friction points this paper is trying to address.

Neural Tangent Kernel

gradient descent

self-referential bias

low-rank

finite-width networks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Tangent Kernel

self-referential bias

low-rank structure