NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

📅 2024-06-08
🏛️ Neural Information Processing Systems
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) lack rigorous, task-specific evaluation in real-world cybersecurity operations, particularly in capture-the-flag (CTF) scenarios. Method: We introduce the first scalable, open-source benchmark dataset tailored to CTF tasks and propose a novel evaluation framework featuring adaptive learning and metadata-annotated structured modeling. This framework enables fully automated, end-to-end CTF execution—including function-call-driven attack orchestration, response generation, and result verification—via integrated toolchains and a hybrid black-box/open-model evaluation architecture. Contribution/Results: We systematically assess five major LLM families across authentic CTF challenges. The project releases the benchmark dataset, evaluation framework, and an interactive Playground, establishing a standardized, empirically grounded infrastructure for AI-driven vulnerability discovery, exploitation, and response research.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized benchmark, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our benchmark dataset open source to public https://github.com/NYU-LLM-CTF/NYU_CTF_Bench along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs in cybersecurity CTF challenges
Develop scalable open-source benchmark dataset
Automate framework for LLM performance assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable open-source benchmark dataset
Automated system with external tool support
Evaluated five LLMs in cybersecurity tasks
🔎 Similar Papers
M
Minghao Shao
New York University
S
Sofija Jancheska
New York University
M
Meet Udeshi
New York University
Brendan Dolan-Gavitt
Brendan Dolan-Gavitt
New York University
SecurityMachine Learning
H
Haoran Xi
New York University
K
Kimberly Milner
New York University
B
Boyuan Chen
New York University Abu Dhabi
Max Yin
Max Yin
New York University
Siddharth Garg
Siddharth Garg
Institute Associate Professor, New York University
AI/MLHardwareSecurityPrivacy
P
P. Krishnamurthy
New York University
F
F. Khorrami
New York University
R
Ramesh Karri
New York University
Muhammad Shafique
Muhammad Shafique
Professor, ECE, New York University (AD-UAE, Tandon-USA), Director eBRAIN Lab
Embedded Machine LearningBrain-Inspired ComputingRobust & Energy-Efficient System DesignSmart