Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

📅 2024-07-31

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Current AI safety evaluations suffer from “safety whitewashing”—a conflation of safety performance with general capabilities (e.g., knowledge, reasoning), leading to misleading claims of safety progress. Method: We propose the first empirically grounded separability criterion: safety metrics must be statistically independent of general capabilities. Using cross-benchmark, multi-model correlation analysis (ρ > 0.9), compute-attributed modeling, and quantification of safety–capability decoupling, we conduct a systematic meta-analysis of AI safety benchmarks. Results: We find that mainstream safety benchmarks predominantly reflect improvements in general capabilities—not genuine safety advances. Building on this, we introduce the first empirically grounded framework for scientific safety evaluation, providing a methodological foundation and empirical evidence for designing orthogonal, reproducible, and goal-directed safety assessment protocols.

Technology Category

Application Category

📝 Abstract

As artificial intelligence systems grow more powerful, there has been increasing interest in"AI safety"research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with both upstream model capabilities and training compute, potentially enabling"safetywashing"--where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.

Problem

Research questions and friction points this paper is trying to address.

AI safety

measurement inconsistency

capability representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI safety benchmarks

empirical foundation

rigorous framework

🔎 Similar Papers

Trustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations

2024-08-23arXiv.orgCitations: 3

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety

2024-04-08arXiv.orgCitations: 17