Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the autonomous offensive capabilities of state-of-the-art AI models in complex, multi-stage cyberattacks involving heterogeneous action sequences. To this end, two specialized network ranges—a 32-step enterprise network and a 7-step industrial control system attack scenario—were constructed, and seven large language models released over an 18-month period were tested under varying inference compute budgets. The work introduces the first automated evaluation framework based on multi-step attack ranges, integrating heterogeneous attack action modeling with high-compute inference scheduling to quantitatively disentangle the independent effects of model generation advances and inference compute on long-chain attack performance. Results show that model performance scales log-linearly with inference compute: the latest models achieve an average of 9.8 out of 32 steps (peaking at 22) under a 10M-token budget and demonstrate, for the first time, partially stable execution in industrial control scenarios (averaging 1.2–1.4 out of 7 steps).

Technology Category

Application Category

📝 Abstract
We evaluate the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges-a 32-step corporate network attack and a 7-step industrial control system attack-that require chaining heterogeneous capabilities across extended action sequences. By comparing seven models released over an eighteen-month period (August 2024 to February 2026) at varying inference-time compute budgets, we observe two capability trends. First, model performance scales log-linearly with inference-time compute, with no observed plateau-increasing from 10M to 100M tokens yields gains of up to 59%, requiring no specific technical sophistication from the operator. Second, each successive model generation outperforms its predecessor at fixed token budgets: on the corporate network range, average steps completed at 10M tokens rose from 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026). The best single run completed 22 of 32 steps, corresponding to roughly 6 of the estimated 14 hours a human expert would need. On the industrial control system range, performance remains limited, though the most recent models are the first to reliably complete steps, averaging 1.2-1.4 of 7 (max 3).
Problem

Research questions and friction points this paper is trying to address.

AI agents
cyber attack
multi-step scenarios
autonomous capabilities
capability evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-step cyber attack
autonomous AI agents
inference-time compute scaling
cyber range evaluation
capability progression
🔎 Similar Papers
No similar papers found.
L
Linus Folkerts
AI Security Institute, United Kingdom
W
Will Payne
AI Security Institute, United Kingdom
S
Simon Inman
AI Security Institute, United Kingdom
P
Philippos Giavridis
AI Security Institute, United Kingdom
J
Joe Skinner
AI Security Institute, United Kingdom
S
Sam Deverett
AI Security Institute, United Kingdom
J
James Aung
AI Security Institute, United Kingdom
E
Ekin Zorer
AI Security Institute, United Kingdom
M
Michael Schmatz
AI Security Institute, United Kingdom
Mahmoud Ghanem
Mahmoud Ghanem
Lecturer of AI and Machine Learning
Machine learningKnowledge managementInnovationSMEsAI
J
John Wilkinson
AI Security Institute, United Kingdom
A
Alan Steer
AI Security Institute, United Kingdom
V
Vy Hong
AI Security Institute, United Kingdom
Jessica Wang
Jessica Wang
Professor of U.S. History, University of British Columbia
U.S. political historystate powerhistory of sciencehistory of medicine