Evaluating LLM Generated Detection Rules in Cybersecurity

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A lack of systematic, standardized evaluation criteria for the effectiveness of cybersecurity detection rules generated by large language models (LLMs) hinders their trustworthiness and real-world deployment. Method: We propose the first open-source, reproducible multi-dimensional evaluation framework, grounded in a real-world security team rule corpus and employing a holdout set methodology. It quantitatively compares LLM-generated rules—exemplified by Automated Detection Engineers (ADE)—against human-authored rules across three dimensions: practicality, coverage, and accuracy. Contribution/Results: We introduce a novel, expert-consensus-based standardized metric suite, enabling the first systematic benchmarking of LLMs’ capability to generate security detection rules. Experimental results demonstrate that ADE-generated rules achieve performance comparable to human-authored rules on key metrics, validating the practical viability of LLMs for this critical cybersecurity task.

Technology Category

Application Category

📝 Abstract
LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a holdout set-based methodology to measure the effectiveness of LLM-generated security rules in comparison to a human-generated corpus of rules. It provides three key metrics inspired by the way experts evaluate security rules, offering a realistic, multifaceted evaluation of the effectiveness of an LLM-based security rule generator. This methodology is illustrated using rules from Sublime Security's detection team and those written by Sublime Security's Automated Detection Engineer (ADE), with a thorough analysis of ADE's skills presented in the results section.
Problem

Research questions and friction points this paper is trying to address.

Evaluating effectiveness of LLM-generated cybersecurity detection rules
Measuring LLM rule performance against human-generated security rules
Providing realistic metrics for LLM-based security rule generators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source evaluation framework for LLM-generated rules
Holdout set methodology comparing LLM and human rules
Three expert-inspired metrics for multifaceted effectiveness evaluation
🔎 Similar Papers
No similar papers found.
A
Anna Bertiger
Sublime Security, Washington, DC, USA
Bobby Filar
Bobby Filar
Sublime Security
natural language understandingphishing detectionadversarial machine learningsecurity
A
Aryan Luthra
Sublime Security, Washington, DC, USA
S
Stefano Meschiari
Sublime Security, Washington, DC, USA
A
Aiden Mitchell
Sublime Security, Washington, DC, USA
S
Sam Scholten
Sublime Security, Washington, DC, USA
V
Vivek Sharath
Sublime Security, Washington, DC, USA