HessFormer: Hessians at Foundation Scale

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Efficient computation of Hessian-vector products (HVPs) remains a critical bottleneck for spectral analysis of billion-parameter-scale foundation models. Method: We propose the first distributed stochastic Lanczos quadrature algorithm scalable to hundred-billion-parameter models, implemented in HessFormer—an open-source, extensible framework integrating intra-node multi-GPU parallelism, distributed Hessian-vector multiplication, stochastic Lanczos quadrature, and deep integration with Transformer libraries. Contribution/Results: Our approach enables, for the first time, empirical spectral density estimation of the DeepSeek-70B model (70B parameters) on a single node—surpassing prior work limited to single-GPU execution and models under 1B parameters. This establishes a scalable spectral analysis infrastructure for second-order optimization, generalization analysis, and trustworthy AI research in large language models.

Technology Category

Application Category

📝 Abstract

Whilst there have been major advancements in the field of first order optimisation of deep learning models, where state of the art open source mixture of expert models go into the hundreds of billions of parameters, methods that rely on Hessian vector products, are still limited to run on a single GPU and thus cannot even work for models in the billion parameter range. We release a software package extbf{HessFormer}, which integrates nicely with the well known Transformers package and allows for distributed hessian vector computation across a single node with multiple GPUs. Underpinning our implementation is a distributed stochastic lanczos quadrature algorithm, which we release for public consumption. Using this package we investigate the Hessian spectral density of the recent Deepseek $70$bn parameter model.

Problem

Research questions and friction points this paper is trying to address.

Enables distributed Hessian computation for large-scale models

Overcomes single-GPU limitations in Hessian vector products

Analyzes Hessian spectral density in billion-parameter models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed Hessian vector computation across multiple GPUs

Integration with Transformers package for compatibility

Distributed stochastic Lanczos quadrature algorithm

🔎 Similar Papers

No similar papers found.