Towards Quantifying the Hessian Structure of Neural Networks

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The origins of the approximate block-diagonal structure observed in neural network Hessian matrices remain theoretically unexplained. Method: We propose a dual explanatory framework—“static forces” (architectural design) and “dynamic forces” (training dynamics)—and rigorously disentangle their effects for the first time. Using random matrix theory, linear models, and single-hidden-layer networks under both MSE and cross-entropy losses, we analyze the spectral behavior of diagonal versus off-diagonal Hessian blocks under random initialization. Contribution/Results: We prove that as the number of classes (C o infty), the spectra of diagonal and off-diagonal Hessian blocks spectrally separate, causing block diagonality to emerge intrinsically; (C) is the dominant structural parameter. This result holds across loss functions and network scales, providing the first falsifiable theoretical explanation for the strongly block-diagonal Hessians empirically observed in large language models ((C sim 10^4)–(10^5)).

Technology Category

Application Category

📝 Abstract
Empirical studies reported that the Hessian matrix of neural networks (NNs) exhibits a near-block-diagonal structure, yet its theoretical foundation remains unclear. In this work, we reveal two forces that shape the Hessian structure: a ``static force'' rooted in the architecture design, and a ``dynamic force'' arisen from training. We then provide a rigorous theoretical analysis of ``static force'' at random initialization. We study linear models and 1-hidden-layer networks with the mean-square (MSE) loss and the Cross-Entropy (CE) loss for classification tasks. By leveraging random matrix theory, we compare the limit distributions of the diagonal and off-diagonal Hessian blocks and find that the block-diagonal structure arises as $C ightarrow infty$, where $C$ denotes the number of classes. Our findings reveal that $C$ is a primary driver of the near-block-diagonal structure. These results may shed new light on the Hessian structure of large language models (LLMs), which typically operate with a large $C$ exceeding $10^4$ or $10^5$.
Problem

Research questions and friction points this paper is trying to address.

Theoretical foundation of Hessian matrix structure in NNs
Impact of static and dynamic forces on Hessian
Role of class count C in block-diagonal structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed Hessian matrix block-diagonal structure in NNs
Identified static and dynamic forces shaping Hessian
Used random matrix theory for theoretical analysis
🔎 Similar Papers
No similar papers found.
Z
Zhaorui Dong
The Chinese University of Hong Kong, Shenzhen, China
Yushun Zhang
Yushun Zhang
The Chinese University of Hong Kong, Shenzhen, China
OptimizationDeep learning
Zhi-Quan Luo
Zhi-Quan Luo
Professor, The Chinese University of Hong Kong, Shenzhen, China
OptimizationSignal ProcessingCommunication
J
Jianfeng Yao
The Chinese University of Hong Kong, Shenzhen, China
R
Ruoyu Sun
The Chinese University of Hong Kong, Shenzhen, China