CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
Existing CTF benchmarks often reuse historical challenges, leading to data contamination and cheating, which undermines the authentic evaluation of large language model (LLM) agents' cybersecurity capabilities. This work proposes CTFusion—the first streaming evaluation framework grounded in live CTF competitions—that mitigates data leakage and competition interference through agent isolation within a single account and a first-solve submission policy. Built upon the CTFd platform, CTFusion implements a Model Context Protocol (MCP) server to support diverse LLM agents and accommodate various competition formats. Experiments across three LLMs, two agent types, and five real-world CTF events reveal significant evaluation biases in conventional benchmarks, whereas CTFusion offers a fairer, more dynamic, and reliable assessment. The code is publicly released to foster community advancement.
📝 Abstract
Recent advances in Large Language Models (LLMs) have enabled agentic systems for complex, multi-step tasks; cybersecurity is emerging as a prominent application. To evaluate such agents, researchers widely adopt Capture The Flag (CTF) benchmarks. However, current CTF benchmarks reuse existing challenges, which exposes them to data contamination and potential cheating. Notably, we confirmed these issues in practice by integrating web search tools into an existing agent. To address these limitations, we present CTFusion, a streaming evaluation framework built on Live CTFs. To achieve this, CTFusion preserves per-agent independence under a single team account and reduces competition impact by forwarding only the first correct flag per challenge. Moreover, we implement CTFusion as a Model Context Protocol (MCP) server on the widely used CTFd platform, which offers broad applicability to diverse CTF events and agent types. Through experiments with three LLMs, two agents, and five Live CTFs, we demonstrate that existing CTF benchmarks can be unreliable in assessing LLM-based agents, while CTFusion can serve as a robust solution for evaluating cybersecurity agents. We release CTFusion as open source to foster future research in this area.
Problem

Research questions and friction points this paper is trying to address.

CTF benchmark
data contamination
LLM agent evaluation
cybersecurity
Innovation

Methods, ideas, or system contributions that make the work stand out.

CTFusion
LLM agent evaluation
Live CTF
Model Context Protocol
cybersecurity benchmark
🔎 Similar Papers
No similar papers found.