Towards Quantifying the Hessian Structure of Neural Networks

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

262K/year

🤖 AI Summary

The origins of the approximate block-diagonal structure observed in neural network Hessian matrices remain theoretically unexplained. Method: We propose a dual explanatory framework—“static forces” (architectural design) and “dynamic forces” (training dynamics)—and rigorously disentangle their effects for the first time. Using random matrix theory, linear models, and single-hidden-layer networks under both MSE and cross-entropy losses, we analyze the spectral behavior of diagonal versus off-diagonal Hessian blocks under random initialization. Contribution/Results: We prove that as the number of classes (C o infty), the spectra of diagonal and off-diagonal Hessian blocks spectrally separate, causing block diagonality to emerge intrinsically; (C) is the dominant structural parameter. This result holds across loss functions and network scales, providing the first falsifiable theoretical explanation for the strongly block-diagonal Hessians empirically observed in large language models ((C sim 10^4)–(10^5)).

Technology Category

Application Category

📝 Abstract

Empirical studies reported that the Hessian matrix of neural networks (NNs) exhibits a near-block-diagonal structure, yet its theoretical foundation remains unclear. In this work, we reveal two forces that shape the Hessian structure: a ``static force'' rooted in the architecture design, and a ``dynamic force'' arisen from training. We then provide a rigorous theoretical analysis of ``static force'' at random initialization. We study linear models and 1-hidden-layer networks with the mean-square (MSE) loss and the Cross-Entropy (CE) loss for classification tasks. By leveraging random matrix theory, we compare the limit distributions of the diagonal and off-diagonal Hessian blocks and find that the block-diagonal structure arises as $C ightarrow infty$, where $C$ denotes the number of classes. Our findings reveal that $C$ is a primary driver of the near-block-diagonal structure. These results may shed new light on the Hessian structure of large language models (LLMs), which typically operate with a large $C$ exceeding $10^4$ or $10^5$.

Problem

Research questions and friction points this paper is trying to address.

Theoretical foundation of Hessian matrix structure in NNs

Impact of static and dynamic forces on Hessian

Role of class count C in block-diagonal structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed Hessian matrix block-diagonal structure in NNs

Identified static and dynamic forces shaping Hessian

Used random matrix theory for theoretical analysis

🔎 Similar Papers

2023-05-10ACM Computing SurveysCitations: 60