Using LLMs for Security Advisory Investigations: How Far Are We?

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Large language models (LLMs), exemplified by ChatGPT, are increasingly deployed in cybersecurity tasks such as vulnerability announcement generation and parsing—yet their reliability in this safety-critical domain remains poorly understood. Method: We conduct a systematic evaluation of LLM trustworthiness across three tasks: generating realistic CVE-ID–anchored security advisories, discriminating real versus fabricated CVE-IDs, and reverse-extracting CVE-IDs from announcements. We introduce a novel “bidirectional consistency” paradigm (generation → verification), leveraging zero-shot prompting and a manually annotated, balanced dataset of 200 real/fake CVE pairs, augmented with multi-round output consistency analysis. Contribution/Results: While LLMs generate plausible advisories for 96% of genuine CVEs, they misclassify 97% of fabricated CVEs as authentic and erroneously map 6% of real advisories to fake CVE-IDs. This reveals a fundamental “high verisimilitude, low verifiability” flaw in LLMs for security communication—providing the first empirical evidence and an actionable evaluation framework for trustworthy AI in cybersecurity applications.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly used in software security, but their trustworthiness in generating accurate vulnerability advisories remains uncertain. This study investigates the ability of ChatGPT to (1) generate plausible security advisories from CVE-IDs, (2) differentiate real from fake CVE-IDs, and (3) extract CVE-IDs from advisory descriptions. Using a curated dataset of 100 real and 100 fake CVE-IDs, we manually analyzed the credibility and consistency of the model's outputs. The results show that ChatGPT generated plausible security advisories for 96% of given input real CVE-IDs and 97% of given input fake CVE-IDs, demonstrating a limitation in differentiating between real and fake IDs. Furthermore, when these generated advisories were reintroduced to ChatGPT to identify their original CVE-ID, the model produced a fake CVE-ID in 6% of cases from real advisories. These findings highlight both the strengths and limitations of ChatGPT in cybersecurity applications. While the model demonstrates potential for automating advisory generation, its inability to reliably authenticate CVE-IDs or maintain consistency upon re-evaluation underscores the risks associated with its deployment in critical security tasks. Our study emphasizes the importance of using LLMs with caution in cybersecurity workflows and suggests the need for further improvements in their design to improve reliability and applicability in security advisory generation.

Problem

Research questions and friction points this paper is trying to address.

Assess ChatGPT's ability to generate accurate security advisories from CVE-IDs

Evaluate ChatGPT's capability to distinguish real from fake CVE-IDs

Examine ChatGPT's consistency in extracting CVE-IDs from advisory descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate security advisories from CVE-IDs

LLMs struggle to differentiate real/fake CVE-IDs

LLMs extract CVE-IDs from advisory descriptions

🔎 Similar Papers

Large Language Models for Cyber Security: A Systematic Literature Review