🤖 AI Summary
This study addresses the subjectivity and poor timeliness of manual CVSS scoring by conducting the first systematic evaluation of large language models (LLMs) for automated CVSS scoring across all three metric categories—Base, Temporal, and Environmental. We propose a multi-strategy prompt engineering framework to optimize LLM generation of CVSS vectors and benchmark it against an embedding-based supervised classification model. Results show that LLMs achieve high accuracy on objective metrics (e.g., Attack Vector, Privileges Required), but significantly underperform on subjective dimensions—Confidentiality, Integrity, and Availability—where the embedding model excels. A hybrid approach combining both methods improves consistency and reliability across all CVSS dimensions. This work establishes a novel paradigm for automated vulnerability severity assessment, advancing CVSS evaluation toward greater efficiency, reproducibility, and scalability.
📝 Abstract
Common Vulnerability and Exposure (CVE) records are fundamental to cybersecurity, offering unique identifiers for publicly known software and system vulnerabilities. Each CVE is typically assigned a Common Vulnerability Scoring System (CVSS) score to support risk prioritization and remediation. However, score inconsistencies often arise due to subjective interpretations of certain metrics. As the number of new CVEs continues to grow rapidly, automation is increasingly necessary to ensure timely and consistent scoring. While prior studies have explored automated methods, the application of Large Language Models (LLMs), despite their recent popularity, remains relatively underexplored. In this work, we evaluate the effectiveness of LLMs in generating CVSS scores for newly reported vulnerabilities. We investigate various prompt engineering strategies to enhance their accuracy and compare LLM-generated scores against those from embedding-based models, which use vector representations classified via supervised learning. Our results show that while LLMs demonstrate potential in automating CVSS evaluation, embedding-based methods outperform them in scoring more subjective components, particularly confidentiality, integrity, and availability impacts. These findings underscore the complexity of CVSS scoring and suggest that combining LLMs with embedding-based methods could yield more reliable results across all scoring components.