Accelerating Neural Network Training Along Sharp and Flat Directions

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental trade-off between curvature sensitivity and convergence speed in neural network training. We propose an optimization framework grounded in the geometric structure of the Hessian matrix. Our analysis reveals that the dominant subspace of the Hessian corresponds to sharp (high-curvature) directions, whereas the bulk subspace aligns with flat (low-curvature) directions. Leveraging this insight, we introduce Bulk-SGD, which constrains gradient updates to the flat subspace to enhance training stability, and further propose an interpolation gradient method that unifies sharp (dominant-subspace) and flat (bulk-subspace) update trajectories. Theoretically, we establish an equivalence between Hessian spectral decomposition and generalized Gauss–Newton / functional Hessian approximations. Empirically, updating along flat directions significantly accelerates convergence; the dominant subspace concentrates over 90% of curvature energy; and our interpolation strategy consistently improves both training stability and final accuracy across CIFAR-10, CIFAR-100, and ImageNet—yielding +0.8–1.3% top-1 accuracy gains.

Technology Category

Application Category

📝 Abstract
Recent work has highlighted a surprising alignment between gradients and the top eigenspace of the Hessian -- termed the Dominant subspace -- during neural network training. Concurrently, there has been growing interest in the distinct roles of sharp and flat directions in the Hessian spectrum. In this work, we study Bulk-SGD, a variant of SGD that restricts updates to the orthogonal complement of the Dominant subspace. Through ablation studies, we characterize the stability properties of Bulk-SGD and identify critical hyperparameters that govern its behavior. We show that updates along the Bulk subspace, corresponding to flatter directions in the loss landscape, can accelerate convergence but may compromise stability. To balance these effects, we introduce interpolated gradient methods that unify SGD, Dom-SGD, and Bulk-SGD. Finally, we empirically connect this subspace decomposition to the Generalized Gauss-Newton and Functional Hessian terms, showing that curvature energy is largely concentrated in the Dominant subspace. Our findings suggest a principled approach to designing curvature-aware optimizers.
Problem

Research questions and friction points this paper is trying to address.

Study gradient alignment with Hessian top eigenspace during training
Explore sharp vs flat directions in Hessian spectrum impact
Balance convergence and stability via subspace-constrained gradient methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bulk-SGD restricts updates to non-dominant subspace
Interpolated gradient methods balance SGD variants
Connects subspace decomposition to curvature energy
🔎 Similar Papers