🤖 AI Summary
Data attribution methods exhibit poor robustness under distributional shift, undermining their practical reliability. This paper introduces the first certified robust attribution framework grounded in a natural Wasserstein metric, applicable uniformly to both convex models and deep neural networks. Key contributions include: (1) defining a natural Wasserstein metric that eliminates spectral amplification effects in representation space; (2) deriving the first nontrivial Lipschitz certification bound for neural network attributions; and (3) establishing that Self-Influence—the empirical estimate of the attribution’s Lipschitz constant—provides a theoretically grounded foundation for anomaly detection. Experiments on CIFAR-10 with ResNet-18 show that our method achieves a 68.7% certified ranking accuracy (versus 0% for baselines), while Self-Influence attains an AUROC of 0.970 for label-noise detection and identifies 94.1% of mislabeled samples within the top 20% ranked instances.
📝 Abstract
Data attribution methods identify which training examples are responsible for a model's predictions, but their sensitivity to distributional perturbations undermines practical reliability. We present a unified framework for certified robust attribution that extends from convex models to deep networks. For convex settings, we derive Wasserstein-Robust Influence Functions (W-RIF) with provable coverage guarantees. For deep networks, we demonstrate that Euclidean certification is rendered vacuous by spectral amplification -- a mechanism where the inherent ill-conditioning of deep representations inflates Lipschitz bounds by over $10{,}000 imes$. This explains why standard TRAK scores, while accurate point estimates, are geometrically fragile: naive Euclidean robustness analysis yields 0% certification. Our key contribution is the Natural Wasserstein metric, which measures perturbations in the geometry induced by the model's own feature covariance. This eliminates spectral amplification, reducing worst-case sensitivity by $76 imes$ and stabilizing attribution estimates. On CIFAR-10 with ResNet-18, Natural W-TRAK certifies 68.7% of ranking pairs compared to 0% for Euclidean baselines -- to our knowledge, the first non-vacuous certified bounds for neural network attribution. Furthermore, we prove that the Self-Influence term arising from our analysis equals the Lipschitz constant governing attribution stability, providing theoretical grounding for leverage-based anomaly detection. Empirically, Self-Influence achieves 0.970 AUROC for label noise detection, identifying 94.1% of corrupted labels by examining just the top 20% of training data.