CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing CTF benchmarks often reuse historical challenges, leading to data contamination and cheating, which undermines the authentic evaluation of large language model (LLM) agents' cybersecurity capabilities. This work proposes CTFusion—the first streaming evaluation framework grounded in live CTF competitions—that mitigates data leakage and competition interference through agent isolation within a single account and a first-solve submission policy. Built upon the CTFd platform, CTFusion implements a Model Context Protocol (MCP) server to support diverse LLM agents and accommodate various competition formats. Experiments across three LLMs, two agent types, and five real-world CTF events reveal significant evaluation biases in conventional benchmarks, whereas CTFusion offers a fairer, more dynamic, and reliable assessment. The code is publicly released to foster community advancement.

📝 Abstract

Recent advances in Large Language Models (LLMs) have enabled agentic systems for complex, multi-step tasks; cybersecurity is emerging as a prominent application. To evaluate such agents, researchers widely adopt Capture The Flag (CTF) benchmarks. However, current CTF benchmarks reuse existing challenges, which exposes them to data contamination and potential cheating. Notably, we confirmed these issues in practice by integrating web search tools into an existing agent. To address these limitations, we present CTFusion, a streaming evaluation framework built on Live CTFs. To achieve this, CTFusion preserves per-agent independence under a single team account and reduces competition impact by forwarding only the first correct flag per challenge. Moreover, we implement CTFusion as a Model Context Protocol (MCP) server on the widely used CTFd platform, which offers broad applicability to diverse CTF events and agent types. Through experiments with three LLMs, two agents, and five Live CTFs, we demonstrate that existing CTF benchmarks can be unreliable in assessing LLM-based agents, while CTFusion can serve as a robust solution for evaluating cybersecurity agents. We release CTFusion as open source to foster future research in this area.

Problem

Research questions and friction points this paper is trying to address.

CTF benchmark

data contamination

LLM agent evaluation

cybersecurity

Innovation

Methods, ideas, or system contributions that make the work stand out.

CTFusion

LLM agent evaluation

Live CTF

Model Context Protocol