🤖 AI Summary
This work addresses the scalability challenge of the generalized Gauss–Newton (GGN) curvature for multiclass softmax cross-entropy, which couples all logits and becomes computationally prohibitive. The authors derive, for the first time, an exact scalar “true class versus the rest” boundary representation of this curvature and leverage it to decompose the GGN matrix into a true-class contrast term and an intra-competitor covariance term. They propose the Fast Gauss–Newton (FGN) method, which retains only the former to yield a positive semidefinite and scalable curvature approximation—exact in the binary case. FGN efficiently solves the damped whitening system via Jacobian-vector products with the scalar boundary mapping, avoiding explicit matrix construction. Experiments show that FGN closely approximates full GGN when competitor logits are concentrated or damping is large, achieving strong performance on fixed-feature multiclass head tasks.
📝 Abstract
In multiclass softmax cross-entropy, the full generalized Gauss-Newton (GGN) curvature couples all output logits through the softmax covariance, making curvature-vector products harder to scale as the number of classes grows. We show that the standard multiclass GGN can be decomposed exactly into a true-vs-rest term and a positive semidefinite within-competitor covariance term. Fast Gauss-Newton (FGN) retains the first term and drops the second, yielding a positive semidefinite under-approximation of the multiclass GGN that is exact for binary classification. The derivation uses an exact true-vs-rest scalar-margin representation of softmax cross-entropy: the loss and gradient are unchanged, and the approximation enters only at the curvature level. Exploiting the FGN curvature structure, the damped update can be written as an equivalent whitened row-space system with one row per mini-batch example. We solve this system matrix-free by conjugate gradient using Jacobian-vector and vector-Jacobian products of the scalar margin map. Targeted mechanism experiments and an evaluation on a fixed-feature multiclass head support the predictions from the decomposition: FGN stays closest to the full softmax GGN when competitor mass is concentrated or damping is large, and deviates as the dropped within-competitor covariance grows.