Evaluating LLM Generated Detection Rules in Cybersecurity

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

A lack of systematic, standardized evaluation criteria for the effectiveness of cybersecurity detection rules generated by large language models (LLMs) hinders their trustworthiness and real-world deployment. Method: We propose the first open-source, reproducible multi-dimensional evaluation framework, grounded in a real-world security team rule corpus and employing a holdout set methodology. It quantitatively compares LLM-generated rules—exemplified by Automated Detection Engineers (ADE)—against human-authored rules across three dimensions: practicality, coverage, and accuracy. Contribution/Results: We introduce a novel, expert-consensus-based standardized metric suite, enabling the first systematic benchmarking of LLMs’ capability to generate security detection rules. Experimental results demonstrate that ADE-generated rules achieve performance comparable to human-authored rules on key metrics, validating the practical viability of LLMs for this critical cybersecurity task.

Technology Category

Application Category

📝 Abstract

LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a holdout set-based methodology to measure the effectiveness of LLM-generated security rules in comparison to a human-generated corpus of rules. It provides three key metrics inspired by the way experts evaluate security rules, offering a realistic, multifaceted evaluation of the effectiveness of an LLM-based security rule generator. This methodology is illustrated using rules from Sublime Security's detection team and those written by Sublime Security's Automated Detection Engineer (ADE), with a thorough analysis of ADE's skills presented in the results section.

Problem

Research questions and friction points this paper is trying to address.

Evaluating effectiveness of LLM-generated cybersecurity detection rules

Measuring LLM rule performance against human-generated security rules

Providing realistic metrics for LLM-based security rule generators

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source evaluation framework for LLM-generated rules

Holdout set methodology comparing LLM and human rules

Three expert-inspired metrics for multifaceted effectiveness evaluation

🔎 Similar Papers

Large Language Models for Cyber Security: A Systematic Literature Review