On the Accuracy of Newton Step and Influence Function Data Attributions

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses theoretical limitations of influence functions (IF) and the single Newton step (NS) method in data attribution: existing analyses rely on unrealistic global strong convexity assumptions, and their error bounds deteriorate sharply with parameter dimension $d$ and number $k$ of removed samples, failing to characterize precise scaling laws. We establish the first asymptotically tight error bounds for IF and NS under non-strongly convex, high-dimensional sparse settings—without requiring global strong convexity. Our theory reveals that NS generally outperforms IF, and we rigorously derive its deviation from the exact retraining parameter as $widetilde{Theta}(kd/n^2)$, versus $widetilde{Theta}((k+d)sqrt{kd}/n^2)$ for IF. Leveraging local curvature characterization, expectation analysis over random subsets, and high-dimensional asymptotic tools, we further prove the average-case optimality of NS.

Technology Category

Application Category

📝 Abstract

Data attribution aims to explain model predictions by estimating how they would change if certain training points were removed, and is used in a wide range of applications, from interpretability and credit assignment to unlearning and privacy. Even in the relatively simple case of linear regressions, existing mathematical analyses of leading data attribution methods such as Influence Functions (IF) and single Newton Step (NS) remain limited in two key ways. First, they rely on global strong convexity assumptions which are often not satisfied in practice. Second, the resulting bounds scale very poorly with the number of parameters ($d$) and the number of samples removed ($k$). As a result, these analyses are not tight enough to answer fundamental questions such as "what is the asymptotic scaling of the errors of each method?" or "which of these methods is more accurate for a given dataset?" In this paper, we introduce a new analysis of the NS and IF data attribution methods for convex learning problems. To the best of our knowledge, this is the first analysis of these questions that does not assume global strong convexity and also the first explanation of [KATL19] and [RH25a]'s observation that NS data attribution is often more accurate than IF. We prove that for sufficiently well-behaved logistic regression, our bounds are asymptotically tight up to poly-logarithmic factors, yielding scaling laws for the errors in the average-case sample removals. [ mathbb{E}_{T subseteq [n],, |T| = k} igl[ |hatθ_T - hatθ_T^{mathrm{NS}}|_2 igr] = widetildeΘ!left(frac{k d}{n^2} ight), qquad mathbb{E}_{T subseteq [n],, |T| = k} igl[ |hatθ_T^{mathrm{NS}} - hatθ_T^{mathrm{IF}}|_2 igr] = widetildeΘ!left( frac{(k + d)sqrt{k d}}{n^2} ight). ]

Problem

Research questions and friction points this paper is trying to address.

Analyzes accuracy of Newton Step and Influence Functions for data attribution.

Addresses limitations of existing analyses lacking tight error bounds.

Compares asymptotic error scaling between NS and IF methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes NS and IF without global strong convexity assumptions

Proves asymptotic tight bounds for logistic regression errors

Explains why Newton Step is often more accurate than Influence Functions

🔎 Similar Papers

Integrated Gradient Correlation: a Dataset-wise Attribution Method