AdaFisher: Adaptive Second Order Optimization via Fisher Information

📅 2024-05-26

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Second-order optimizers for deep neural networks (DNNs) suffer from high computational overhead and sensitivity to hyperparameters, limiting their practical adoption. To address this, we propose AdaFisher—a novel adaptive second-order optimization method based on a block-diagonal approximation of the Fisher information matrix (FIM). Its core innovation lies in the first integration of an adaptive block-diagonal Fisher approximation with a dynamic damping mechanism, enabling efficient gradient preconditioning and robust curvature estimation. AdaFisher significantly reduces computational complexity while improving convergence speed, generalization performance, and hyperparameter robustness. Extensive experiments on image classification and language modeling benchmarks demonstrate that AdaFisher consistently outperforms state-of-the-art first-order optimizers—including Adam—in both accuracy and convergence rate. Moreover, it exhibits markedly reduced sensitivity to critical hyperparameters such as the learning rate, enhancing training stability and ease of use.

Technology Category

Application Category

📝 Abstract

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code is available from https://github.com/AtlasAnalyticsLab/AdaFisher.

Problem

Research questions and friction points this paper is trying to address.

Bridging gap between convergence and computational efficiency

Enhancing DNN training with adaptive second-order optimization

Improving accuracy and convergence speed over SOTA optimizers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive second-order optimizer using Fisher information

Diagonal block-Kronecker approximation for gradient preconditioning

Enhances convergence, generalization, and computational efficiency

🔎 Similar Papers

No similar papers found.