🤖 AI Summary
This study addresses the long-standing limitations of open-source cyber threat intelligence (CTI)—notably inconsistent reporting standards and a lack of structured data—that have fragmented understanding of relationships between threat actors and victims. The authors present the first large language model–based automated pipeline to systematically structure 13,308 CTI reports spanning two decades with high precision, extracting key entities including threat actors, motives, victims, vendors, and technical indicators. Through large-scale quantitative analysis of CTI information density, vendor overlap, and geographic and sectoral biases, the work reveals, for the first time, an “echo chamber” effect and structural biases within the CTI ecosystem: while core vendors provide foundational coverage, marginal gains from additional sources diminish significantly, and intelligence exhibits pronounced geographic and industry-sector skew.
📝 Abstract
Despite the high volume of open-source Cyber Threat Intelligence (CTI), our understanding of long-term threat actor-victim dynamics remains fragmented due to the lack of structured datasets and inconsistent reporting standards. In this paper, we present a large-scale automated analysis of open-source CTI reports spanning two decades. We develop a high-precision, LLM-based pipeline to ingest and structure 13,308 reports, extracting key entities such as attributed threat actors, motivations, victims, reporting vendors, and technical indicators (IoCs and TTPs). Our analysis quantifies the evolution of CTI information density and specialization, characterizing patterns that relate specific threat actors to motivations and victim profiles. Furthermore, we perform a meta-analysis of the CTI industry itself. We identify a fragmented ecosystem of distinct silos where vendors demonstrate significant geographic and sectoral reporting biases. Our marginal coverage analysis reveals that intelligence overlap between vendors is typically low: while a few core providers may offer broad situational awareness, additional sources yield diminishing returns. Overall, our findings characterize the structural biases inherent in the CTI ecosystem, enabling practitioners and researchers to better evaluate the completeness of their intelligence sources.