Efficiently Attacking Memorization Scores

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work exposes the severe fragility of mainstream memorization scoring methods—such as influence estimation—under adversarial settings. To exploit black-box model access, the authors propose a novel attack framework based on input pseudoinverse computation and sensitivity-aware query optimization, providing the first systematic demonstration that memorization scores can be precisely manipulated even for high-accuracy models. Extensive experiments across diverse image classification tasks confirm that prevailing memorization proxy metrics are vulnerable to targeted manipulation. Theoretical analysis further characterizes their instability boundaries under adversarial perturbations. Collectively, this work uncovers a critical security vulnerability in data attribution mechanisms and fundamentally challenges the robustness foundations of influence estimation as a trustworthy explanatory tool.

Technology Category

Application Category

📝 Abstract

Influence estimation tools -- such as memorization scores -- are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://anonymous.4open.science/r/MemAttack-5413/

Problem

Research questions and friction points this paper is trying to address.

Investigating adversarial manipulation of memorization-based influence estimators

Developing practical attacks to manipulate memorization scores in ML models

Analyzing vulnerabilities and fragility of influence estimation tools

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attacks memorization scores via pseudoinverse calculation

Requires only black-box model access for practicality

Theoretically analyzes score fragility under adversarial perturbations

🔎 Similar Papers

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon