Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current large language model (LLM) agents lack systematic evaluation in real-world cybersecurity tasks requiring multi-step reasoning and tool usage. To address this gap, this work proposes DeepRed—the first CTF agent evaluation framework that supports partial scoring. DeepRed enables fine-grained quantification of complex offensive and defensive behaviors through isolated virtual target machines, integrated terminal tools, optional web search, execution trace logging, challenge-embedded checkpoints, and automated log analysis. Evaluated on ten VM-based CTF challenges across ten leading commercial LLMs, the best-performing model completed only 35% of checkpoints on average, revealing significant limitations in current agents’ capabilities for discovering unconventional vulnerabilities and adapting over extended reasoning chains.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

cybersecurity

Capture The Flag

evaluation benchmark

partial-credit scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

partial-credit evaluation

LLM agents

Capture The Flag