Position: AI Security Policy Should Target Systems, Not Models

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Current AI safety regulation overly emphasizes individual models while neglecting systemic risks. This work proposes a swarm-attack framework in which multiple lightweight open-source language models operate in parallel within shared memory, leveraging evolutionary optimization and collaborative interaction to achieve low-cost security bypasses against state-of-the-art large models and effective C program vulnerability discovery. The approach demonstrates that high-order security threats stem not from the capability of any single model but from multi-agent coordination and toolchain integration. Experiments show that on a consumer-grade MacBook, five 1.2B-parameter models achieve a 45.8% effective harm rate against GPT-4o, uncovering 49 critical jailbreaks; when integrated with a basic toolchain, the system reproduces nine known CWE vulnerabilities with 100% reliability in approximately four minutes.

📝 Abstract

We present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software vulnerability discovery, i.e., the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. We report two experiments. In the first, five instances of a 1.2 billion parameter model conducted 225 jailbreak attacks each against GPT-4o and Claude Sonnet~4. Against GPT-4o, the swarm achieved an Effective Harm Rate of 45.8%, producing 49 critical-severity breaches; against Claude Sonnet-4, the Effective Harm Rate was 0% despite a 40% technical success rate. In the second experiment, the same models performed combined source code analysis and binary fuzzing against a vulnerable C application with 9 planted CWEs. With a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification, the pipeline recovers 9 of 9 vulnerabilities (100% recall) in approximately four minutes on a consumer MacBook. With those scaffold components disabled, the same model recovers 0 of 9 by crash verification and 2 of 9 by citation. The capability class that motivated restricted release of Anthropic's Mythos Preview is therefore reproducible at effectively zero cost; the important enabler is the system scaffold itself, which compensates for the limited reasoning capacity of small individual models.

Problem

Research questions and friction points this paper is trying to address.

AI security policy

adversarial testing

LLM swarm

safety bypass

vulnerability discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

swarm-attack

adversarial testing

LLM agents