π€ AI Summary
This work addresses the βsecurity gapββa critical phenomenon wherein open-weight large language models (LLMs) exhibit a sharp increase in hazardous capabilities upon removal of safety mitigations. We propose the first formal, quantitative definition of the security gap and introduce a multidimensional evaluation toolkit encompassing biochemical and cyberattack capability benchmarks, refusal-rate analysis, and generation-quality assessment. Through systematic experiments across Llama-3 and Qwen-2.5 families spanning multiple parameter scales, we demonstrate that the security gap grows significantly with model size and that existing alignment mechanisms are readily circumvented. Moreover, diverse safety-removal techniques consistently induce dramatic surges in high-risk capabilities. Our study is the first to empirically establish the scale-dependent nature of the security gap, thereby advancing tamper-resistant safety evaluation paradigms. The evaluation toolkit is publicly released to foster community-driven research and benchmarking.
π Abstract
Open-weight large language models (LLMs) unlock huge benefits in innovation, personalization, privacy, and democratization. However, their core advantage - modifiability - opens the door to systemic risks: bad actors can trivially subvert current safeguards, turning beneficial models into tools for harm. This leads to a 'safety gap': the difference in dangerous capabilities between a model with intact safeguards and one that has been stripped of those safeguards. We open-source a toolkit to estimate the safety gap for state-of-the-art open-weight models. As a case study, we evaluate biochemical and cyber capabilities, refusal rates, and generation quality of models from two families (Llama-3 and Qwen-2.5) across a range of parameter scales (0.5B to 405B) using different safeguard removal techniques. Our experiments reveal that the safety gap widens as model scale increases and effective dangerous capabilities grow substantially when safeguards are removed. We hope that the Safety Gap Toolkit (https://github.com/AlignmentResearch/safety-gap) will serve as an evaluation framework for common open-source models and as a motivation for developing and testing tamper-resistant safeguards. We welcome contributions to the toolkit from the community.