🤖 AI Summary
Efficient computation of Hessian-vector products (HVPs) remains a critical bottleneck for spectral analysis of billion-parameter-scale foundation models. Method: We propose the first distributed stochastic Lanczos quadrature algorithm scalable to hundred-billion-parameter models, implemented in HessFormer—an open-source, extensible framework integrating intra-node multi-GPU parallelism, distributed Hessian-vector multiplication, stochastic Lanczos quadrature, and deep integration with Transformer libraries. Contribution/Results: Our approach enables, for the first time, empirical spectral density estimation of the DeepSeek-70B model (70B parameters) on a single node—surpassing prior work limited to single-GPU execution and models under 1B parameters. This establishes a scalable spectral analysis infrastructure for second-order optimization, generalization analysis, and trustworthy AI research in large language models.
📝 Abstract
Whilst there have been major advancements in the field of first order optimisation of deep learning models, where state of the art open source mixture of expert models go into the hundreds of billions of parameters, methods that rely on Hessian vector products, are still limited to run on a single GPU and thus cannot even work for models in the billion parameter range. We release a software package extbf{HessFormer}, which integrates nicely with the well known Transformers package and allows for distributed hessian vector computation across a single node with multiple GPUs. Underpinning our implementation is a distributed stochastic lanczos quadrature algorithm, which we release for public consumption. Using this package we investigate the Hessian spectral density of the recent Deepseek $70$bn parameter model.