🤖 AI Summary
This work exposes the severe fragility of mainstream memorization scoring methods—such as influence estimation—under adversarial settings. To exploit black-box model access, the authors propose a novel attack framework based on input pseudoinverse computation and sensitivity-aware query optimization, providing the first systematic demonstration that memorization scores can be precisely manipulated even for high-accuracy models. Extensive experiments across diverse image classification tasks confirm that prevailing memorization proxy metrics are vulnerable to targeted manipulation. Theoretical analysis further characterizes their instability boundaries under adversarial perturbations. Collectively, this work uncovers a critical security vulnerability in data attribution mechanisms and fundamentally challenges the robustness foundations of influence estimation as a trustworthy explanatory tool.
📝 Abstract
Influence estimation tools -- such as memorization scores -- are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://anonymous.4open.science/r/MemAttack-5413/