🤖 AI Summary
This study addresses the widespread issue in AI red-teaming evaluations where comparisons of attack success rates (ASR) often rely on invalid or incomparable measurements, leading to erroneous conclusions about system security or attack efficacy. For the first time, the paper introduces measurement validity theory from social sciences into the AI red-teaming domain, systematically analyzing—through inferential statistics and illustrative case studies such as jailbreaking attacks—the conditions under which ASR comparisons are meaningful. The work establishes clear prerequisites for valid ASR comparisons, identifies and categorizes common patterns of invalid comparison, and thereby provides a rigorous theoretical foundation that significantly enhances the scientific rigor and comparability of AI safety evaluations.
📝 Abstract
We argue that conclusions drawn about relative system safety or attack method efficacy via AI red teaming are often not supported by evidence provided by attack success rate (ASR) comparisons. We show, through conceptual, theoretical, and empirical contributions, that many conclusions are founded on apples-to-oranges comparisons or low-validity measurements. Our arguments are grounded in asking a simple question: When can attack success rates be meaningfully compared? To answer this question, we draw on ideas from social science measurement theory and inferential statistics, which, taken together, provide a conceptual grounding for understanding when numerical values obtained through the quantification of system attributes can be meaningfully compared. Through this lens, we articulate conditions under which ASRs can and cannot be meaningfully compared. Using jailbreaking as a running example, we provide examples and extensive discussion of apples-to-oranges ASR comparisons and measurement validity challenges.