Measuring Safety Alignment Effects in Autonomous Security Agents

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing single-turn refusal benchmarks inadequately capture the real-world behavior of safety-aligned language models in autonomous agent scenarios such as multi-step vulnerability analysis. This work proposes the first trajectory-based benchmark, encompassing 30 local vulnerability tasks, which systematically evaluates models acting as security agents through a fixed toolset, deterministic success criteria, data sanitization rules, and evidence credibility checks. The study innovatively decouples refusal rate, unsafe behaviors, tool reliability, and evidence quality at the system level, revealing that reliance solely on refusal rate can misrepresent alignment efficacy and uncovering inconsistent safety–capability trade-offs across model families. Experiments show that unrestricted Gemma variants significantly improve task success while maintaining safety, whereas Qwen2.5-Coder and Llama variants exhibit performance degradation; notably, all models fail to complete hard-trigger proof and patch verification tasks.

📝 Abstract

Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes. We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, and compare four stock models against uncensored or abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. The artifact contains 1,500 security-agent traces and 800 non-security control traces. The Gemma pairs show large less-restricted gains on security tasks: 14.0% versus 0.7% success for 31B and 10.7% versus 0.0% for 26B, with higher mean grounding (3.91 versus 3.27 and 4.12 versus 1.64 out of five) and 0.0% refusal, suppressed-action, and unsafe-action rates in the 31B traces. However, controls and non-Gemma pairs rule out a clean security-specific or universal less-restricted effect: Gemma gaps also appear on ordinary coding tasks, Qwen2.5-Coder success is lower for the less-restricted derivative (2.0% versus 5.3%), and the abliterated Llama derivative fails the tool protocol. Across all families, hard proof-of-trigger and patch-verification tasks remain unsolved. These results show that safety alignment effects in autonomous security agents should be measured at the system level, separating refusal, unsafe action, tool reliability, and evidence grounding rather than treating refusal rate as the safety signal.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

autonomous security agents

vulnerability analysis

language models

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

trace-based benchmark

autonomous security agents

safety alignment