Counterfactual Influence as a Distributional Quantity

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing privacy risk assessments of machine learning models—particularly those relying solely on self-influence—underestimate memorization effects, especially under external perturbations such as near-duplicate training samples. Method: We propose the *full influence distribution*, a counterfactual, distributional memorization metric that systematically quantifies the joint influence of the entire training set on any individual sample. Contribution/Results: Through exact computation of full influence matrices and ablation studies on small-scale language models and image classification tasks, we demonstrate that near-duplicates substantially suppress self-influence scores yet remain recoverable via our framework—revealing severe underestimation of memorization and associated privacy risks by conventional methods. Moreover, the framework robustly uncovers latent near-duplicate structures in multimodal datasets, significantly improving both accuracy and robustness of memorization risk assessment.

Technology Category

Application Category

📝 Abstract

Machine learning models are known to memorize samples from their training data, raising concerns around privacy and generalization. Counterfactual self-influence is a popular metric to study memorization, quantifying how the model's prediction for a sample changes depending on the sample's inclusion in the training dataset. However, recent work has shown memorization to be affected by factors beyond self-influence, with other training samples, in particular (near-)duplicates, having a large impact. We here study memorization treating counterfactual influence as a distributional quantity, taking into account how all training samples influence how a sample is memorized. For a small language model, we compute the full influence distribution of training samples on each other and analyze its properties. We find that solely looking at self-influence can severely underestimate tangible risks associated with memorization: the presence of (near-)duplicates seriously reduces self-influence, while we find these samples to be (near-)extractable. We observe similar patterns for image classification, where simply looking at the influence distributions reveals the presence of near-duplicates in CIFAR-10. Our findings highlight that memorization stems from complex interactions across training data and is better captured by the full influence distribution than by self-influence alone.

Problem

Research questions and friction points this paper is trying to address.

Study memorization via counterfactual influence distribution

Analyze impact of near-duplicates on memorization risks

Compare self-influence vs full influence distribution effects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Treats counterfactual influence as distributional quantity

Computes full influence distribution among samples

Reveals memorization risks beyond self-influence

🔎 Similar Papers

Counterfactual Influence in Markov Decision Processes