RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of manually responding to the overwhelming volume of CVE vulnerabilities by proposing an automated approach for generating web vulnerability detection rules using large language models (LLMs). The method parses Nuclei templates and integrates structured and unstructured data to produce high-precision JSON-based detection rules. It employs an innovative “LLM-as-a-judge” confidence validation framework combined with a 5×5 generation strategy, and incorporates a human-in-the-loop feedback loop to continuously refine rule quality. Experimental results demonstrate that the system reduces false positive rates by 67% in production environments and achieves an AUROC of 0.75 for rule validation, significantly enhancing both the accuracy and efficiency of large-scale vulnerability detection.
📝 Abstract
Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.
Problem

Research questions and friction points this paper is trying to address.

CVE
vulnerability detection
automated rule generation
web security
false positives
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-judge
automated rule generation
vulnerability detection
feedback integration
Nuclei templates
🔎 Similar Papers
No similar papers found.
Ayush Garg
Ayush Garg
Unknown affiliation
Machine LearningNatural Language ProcessingComputer Vision
S
Sophia Hager
Johns Hopkins University
Jacob Montiel
Jacob Montiel
Amazon
machine learningdata sciencecybersecuritydata streams
A
Aditya Tiwari
Amazon Web Services
M
Michael Gentile
Amazon Web Services
Z
Zach Reavis
Amazon Web Services
D
David Magnotti
Amazon Web Services
W
Wayne Fullen
Amazon Web Services