🤖 AI Summary
Machine learning models deployed in high-stakes decision-making domains—such as credit scoring and hiring—often neglect the malleability of individual features, leading to security vulnerabilities under strategic behavior.
Method: We propose a statistical inference framework for responsive verification that quantifies model sensitivity and adaptivity to controlled feature interventions. Specifically, we formalize responsiveness as intervention-aware sensitivity analysis, enabling unbiased estimation of response probabilities and failure-risk quantification via uniform black-box sampling over the feasible state space—subject to both domain constraints and downstream effect distributions.
Contribution/Results: This work introduces the first verifiable paradigm explicitly designed to assess dynamic feedback capability, supporting formal failure testing and robustness evaluation. Empirical validation across recidivism prediction, organ transplant prioritization, and content moderation demonstrates significant improvements in detecting behavior-induced safety failures.
📝 Abstract
Many safety failures in machine learning arise when models are used to assign predictions to people (often in settings like lending, hiring, or content moderation) without accounting for how individuals can change their inputs. In this work, we introduce a formal validation procedure for the responsiveness of predictions with respect to interventions on their features. Our procedure frames responsiveness as a type of sensitivity analysis in which practitioners control a set of changes by specifying constraints over interventions and distributions over downstream effects. We describe how to estimate responsiveness for the predictions of any model and any dataset using only black-box access, and how to use these estimates to support tasks such as falsification and failure probability estimation. We develop algorithms that construct these estimates by generating a uniform sample of reachable points, and demonstrate how they can promote safety in real-world applications such as recidivism prediction, organ transplant prioritization, and content moderation.