Rescaled Influence Functions: Accurate Data Attribution in High Dimension

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

In high-dimensional regimes (where parameter count ≥ sample size), conventional influence functions (IFs) systematically underestimate the effect of individual training samples on model predictions, leading to inaccurate data attribution. To address this, we propose the rescaled influence function (RIF), the first method that substantially improves estimation accuracy for sample-level influences in high-dimensional models while preserving computational efficiency. RIF reformulates the IF framework via a second-order Taylor expansion, integrating an efficient approximation of the Hessian inverse and a parameter rescaling technique, enabling plug-and-play deployment. Our theoretical analysis identifies the root cause of IF’s systematic underestimation—namely, unaccounted curvature effects in overparameterized settings. Extensive experiments across multiple real-world datasets demonstrate that RIF reduces attribution error by over 40% on average compared to baseline methods. Moreover, RIF successfully detects novel data poisoning attacks that remain invisible to standard IF-based diagnostics.

Technology Category

Application Category

📝 Abstract

How does the training data affect a model's behavior? This is the question we seek to answer with data attribution. The leading practical approaches to data attribution are based on influence functions (IF). IFs utilize a first-order Taylor approximation to efficiently predict the effect of removing a set of samples from the training set without retraining the model, and are used in a wide variety of machine learning applications. However, especially in the high-dimensional regime (# params $geq Omega($# samples$)$), they are often imprecise and tend to underestimate the effect of sample removals, even for simple models such as logistic regression. We present rescaled influence functions (RIF), a new tool for data attribution which can be used as a drop-in replacement for influence functions, with little computational overhead but significant improvement in accuracy. We compare IF and RIF on a range of real-world datasets, showing that RIFs offer significantly better predictions in practice, and present a theoretical analysis explaining this improvement. Finally, we present a simple class of data poisoning attacks that would fool IF-based detections but would be detected by RIF.

Problem

Research questions and friction points this paper is trying to address.

Improving accuracy of data attribution in high dimensions

Addressing underestimation in influence function predictions

Detecting data poisoning attacks more effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rescaled Influence Functions improve accuracy

Drop-in replacement for influence functions

Better predictions in high-dimensional regimes

🔎 Similar Papers

TimeInf: Time Series Data Contribution via Influence Functions