AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of realistic evaluation of large language models’ (LLMs) offensive and defensive capabilities in adversarial machine learning. We introduce the first benchmark for autonomously attacking real-world defense systems against adversarial examples—grounded in actual security expert workflows, rather than proxy safety evaluations. Distinct from prior work, we explicitly differentiate between CTF-style pedagogical scenarios and authentic defense tasks, and propose a novel evaluation paradigm integrating autonomous agents, domain-specific attack/defense knowledge modeling, dynamic analysis of defense code, and multi-step reasoning. Experiments show that the strongest LLM agent breaches 75% of CTF defenses but only 13% of real-world defenses; scaling to more capable models improves success to just 21%, exposing a substantial capability gap. This stark disparity validates the benchmark’s high discriminative power and rigor in assessing LLM security competence.

Technology Category

Application Category

📝 Abstract
We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in bench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking"real"code, and CTF-like code. In contrast, a stronger LLM that can attack 21% of real defenses only succeeds on 54% of CTF-like defenses. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' ability to exploit adversarial example defenses.
Measures LLMs' success on real-world machine learning security tasks.
Highlights gap between CTF-like and real-world defense attack difficulty.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates LLMs on adversarial defenses.
Agent breaks 75% CTF-like defenses, 13% real-world.
Stronger LLM attacks 21% real, 54% CTF-like defenses.
🔎 Similar Papers
No similar papers found.