Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
How can the diagonal of the Fisher Information Matrix (FIM) be approximated at zero computational cost? This paper introduces Squisher, the first method to systematically demonstrate that the squared-gradient moving averages—already maintained by adaptive optimizers such as Adam—serve as high-fidelity, zero-cost approximations to the FIM diagonal, requiring no additional forward/backward passes or stochastic sampling. By reusing existing training-time statistics, Squisher enables parameter sensitivity quantification without increasing computational overhead. Extensive experiments across five canonical tasks—including network pruning, uncertainty estimation, and continual learning—show that Squisher matches the accuracy of standard FIM diagonal estimation while substantially outperforming existing baselines. Moreover, it is fully plug-and-play, integrating seamlessly into standard training pipelines without architectural or optimization modifications.

Technology Category

Application Category

📝 Abstract
The diagonal of a model's Fisher Information Matrix (the "Fisher diagonal") has frequently been used as a way to measure parameter sensitivity. Typically, the Fisher diagonal is estimated via squared sampled gradients of the model's likelihood with respect to its parameters, averaged over a few hundred or thousand examples -- a process which incurs nontrivial computational costs. At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training. This paper therefore explores whether an approximation of the Fisher diagonal can be obtained "for free" by recycling the squared gradient accumulator that has already been computed over the course of training. Through a comprehensive set of experiments covering five applications of the Fisher diagonal, we demonstrate that the "Squisher" (SQUared gradient accumulator as an approximation of the FISHER) consistently performs similarly to the Fisher diagonal while outperforming baseline methods. Additionally, we clarify the exact differences between the Squisher and the Fisher diagonal and provide empirical quantification of their respective impact.
Problem

Research questions and friction points this paper is trying to address.

Estimating Fisher diagonal without extra computational cost
Reusing squared gradient accumulator for Fisher approximation
Comparing Squisher performance with Fisher diagonal and baselines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reuse squared gradient accumulator for Fisher diagonal
Approximate Fisher diagonal without extra computation
Squisher matches Fisher diagonal performance efficiently
🔎 Similar Papers
No similar papers found.