🤖 AI Summary
A lack of systematic, standardized evaluation criteria for the effectiveness of cybersecurity detection rules generated by large language models (LLMs) hinders their trustworthiness and real-world deployment. Method: We propose the first open-source, reproducible multi-dimensional evaluation framework, grounded in a real-world security team rule corpus and employing a holdout set methodology. It quantitatively compares LLM-generated rules—exemplified by Automated Detection Engineers (ADE)—against human-authored rules across three dimensions: practicality, coverage, and accuracy. Contribution/Results: We introduce a novel, expert-consensus-based standardized metric suite, enabling the first systematic benchmarking of LLMs’ capability to generate security detection rules. Experimental results demonstrate that ADE-generated rules achieve performance comparable to human-authored rules on key metrics, validating the practical viability of LLMs for this critical cybersecurity task.
📝 Abstract
LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a holdout set-based methodology to measure the effectiveness of LLM-generated security rules in comparison to a human-generated corpus of rules. It provides three key metrics inspired by the way experts evaluate security rules, offering a realistic, multifaceted evaluation of the effectiveness of an LLM-based security rule generator. This methodology is illustrated using rules from Sublime Security's detection team and those written by Sublime Security's Automated Detection Engineer (ADE), with a thorough analysis of ADE's skills presented in the results section.