🤖 AI Summary
Current large language model (LLM) agents lack systematic evaluation in real-world cybersecurity tasks requiring multi-step reasoning and tool usage. To address this gap, this work proposes DeepRed—the first CTF agent evaluation framework that supports partial scoring. DeepRed enables fine-grained quantification of complex offensive and defensive behaviors through isolated virtual target machines, integrated terminal tools, optional web search, execution trace logging, challenge-embedded checkpoints, and automated log analysis. Evaluated on ten VM-based CTF challenges across ten leading commercial LLMs, the best-performing model completed only 35% of checkpoints on average, revealing significant limitations in current agents’ capabilities for discovering unconventional vulnerabilities and adapting over extended reasoning chains.
📝 Abstract
Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.