The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the “security gap”—a critical phenomenon wherein open-weight large language models (LLMs) exhibit a sharp increase in hazardous capabilities upon removal of safety mitigations. We propose the first formal, quantitative definition of the security gap and introduce a multidimensional evaluation toolkit encompassing biochemical and cyberattack capability benchmarks, refusal-rate analysis, and generation-quality assessment. Through systematic experiments across Llama-3 and Qwen-2.5 families spanning multiple parameter scales, we demonstrate that the security gap grows significantly with model size and that existing alignment mechanisms are readily circumvented. Moreover, diverse safety-removal techniques consistently induce dramatic surges in high-risk capabilities. Our study is the first to empirically establish the scale-dependent nature of the security gap, thereby advancing tamper-resistant safety evaluation paradigms. The evaluation toolkit is publicly released to foster community-driven research and benchmarking.

Technology Category

Application Category

📝 Abstract

Open-weight large language models (LLMs) unlock huge benefits in innovation, personalization, privacy, and democratization. However, their core advantage - modifiability - opens the door to systemic risks: bad actors can trivially subvert current safeguards, turning beneficial models into tools for harm. This leads to a 'safety gap': the difference in dangerous capabilities between a model with intact safeguards and one that has been stripped of those safeguards. We open-source a toolkit to estimate the safety gap for state-of-the-art open-weight models. As a case study, we evaluate biochemical and cyber capabilities, refusal rates, and generation quality of models from two families (Llama-3 and Qwen-2.5) across a range of parameter scales (0.5B to 405B) using different safeguard removal techniques. Our experiments reveal that the safety gap widens as model scale increases and effective dangerous capabilities grow substantially when safeguards are removed. We hope that the Safety Gap Toolkit (https://github.com/AlignmentResearch/safety-gap) will serve as an evaluation framework for common open-source models and as a motivation for developing and testing tamper-resistant safeguards. We welcome contributions to the toolkit from the community.

Problem

Research questions and friction points this paper is trying to address.

Evaluating hidden risks in open-weight large language models

Measuring safety gap between safeguarded and modified models

Assessing dangerous capabilities after safeguard removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source toolkit for evaluating safety gaps

Assesses dangerous capabilities after safeguard removal

Tests models across scales and removal techniques

🔎 Similar Papers

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety