🤖 AI Summary
This work proposes an automated framework for generating high-quality memory corruption vulnerability analysis reports by integrating a multi-agent large language model (LLM) architecture with retrieval-augmented generation (RAG). The framework comprises four collaborative modules—Explorer, RAG Engine, Analyst, and Reporter—and represents the first application of a multi-agent LLM combined with RAG specifically for vulnerability documentation. It further introduces a task-specific LLM-based Judge to enable multidimensional automatic evaluation of the generated reports. Experimental results on 105 samples from the NIST-SARD dataset demonstrate an average report quality score of 54.21%, validating the effectiveness of the approach and significantly advancing the state of the art in automated vulnerability analysis and structured reporting.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification, detection, and patching. However, their potential in automated vulnerability report documentation and analysis remains underexplored. We present RAVEN (Retrieval Augmented Vulnerability Exploration Network), a framework leveraging LLM agents and Retrieval Augmented Generation (RAG) to synthesize comprehensive vulnerability analysis reports. Given vulnerable source code, RAVEN generates reports following the Google Project Zero Root Cause Analysis template. The framework uses four modules: an Explorer agent for vulnerability identification, a RAG engine retrieving relevant knowledge from curated databases including Google Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured report generation. To ensure quality, RAVEN includes a task specific LLM Judge evaluating reports across structural integrity, ground truth alignment, code reasoning quality, and remediation quality. We evaluate RAVEN on 105 vulnerable code samples covering 15 CWE types from the NIST-SARD dataset. Results show an average quality score of 54.21%, supporting the effectiveness of our approach for automated vulnerability documentation.