Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models

📅 2024-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation in deep learning caused by noisy or mislabeled training data, this paper proposes a gradient-based anomaly detection method that avoids computationally expensive Hessian matrix inversion. It establishes, for the first time, a theoretical connection between influence functions and statistical anomalies in parameter gradients. The method identifies harmful training samples by modeling the statistical distribution of per-sample gradients, augmented with influence propagation analysis and synthetic-data validation. Evaluated on image label-corruption detection, NLP data selection, and large language model (LLM) fine-tuning sample filtering, it significantly outperforms conventional influence-function approaches. Crucially, it reduces computational overhead by approximately two orders of magnitude while preserving scalability, interpretability, and practical deployability—enabling efficient, large-scale data curation without sacrificing theoretical grounding.

Technology Category

Application Category

📝 Abstract
A core data-centric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning Large Language Models.
Problem

Research questions and friction points this paper is trying to address.

Identifying detrimental training samples for deep learning models
Reducing computational cost of influence functions in large models
Detecting mislabeled samples and selecting data for model improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Outlier gradient detection replaces influence functions
Hessian-free formulation reduces computational cost
Effective for mislabeled data and LLM fine-tuning
🔎 Similar Papers
No similar papers found.