CTI-HAL: A Human-Annotated Dataset for Cyber Threat Intelligence Analysis

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing CTI datasets lack fine-grained, framework-aligned annotations, hindering automated threat intelligence understanding. Method: We introduce CTI-HAL—the first high-quality, open-source dataset built from real-world threat reports, manually co-annotated and strictly aligned with MITRE ATT&CK’s tactical and technical layers. It enables report-level mapping to atomic ATT&CK capabilities and employs Krippendorff’s alpha to quantify inter-annotator agreement (α > 0.85). We evaluate LLMs’ zero-shot and few-shot generalization in realistic operational scenarios. Contribution/Results: Unfine-tuned LLMs accurately reconstruct APT tactical chains and achieve a 23.6% F1-score improvement in cross-report structured extraction over SOTA baselines. CTI-HAL establishes a reproducible, verifiable benchmark and methodological paradigm for automated CTI analysis.

Technology Category

Application Category

📝 Abstract
Organizations are increasingly targeted by Advanced Persistent Threats (APTs), which involve complex, multi-stage tactics and diverse techniques. Cyber Threat Intelligence (CTI) sources, such as incident reports and security blogs, provide valuable insights, but are often unstructured and in natural language, making it difficult to automatically extract information. Recent studies have explored the use of AI to perform automatic extraction from CTI data, leveraging existing CTI datasets for performance evaluation and fine-tuning. However, they present challenges and limitations that impact their effectiveness. To overcome these issues, we introduce a novel dataset manually constructed from CTI reports and structured according to the MITRE ATT&CK framework. To assess its quality, we conducted an inter-annotator agreement study using Krippendorff alpha, confirming its reliability. Furthermore, the dataset was used to evaluate a Large Language Model (LLM) in a real-world business context, showing promising generalizability.
Problem

Research questions and friction points this paper is trying to address.

Unstructured CTI data hinders automatic information extraction
Existing CTI datasets have limitations affecting AI effectiveness
Need for reliable annotated dataset aligned with ATT&CK framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-annotated dataset based on MITRE ATT&CK
Inter-annotator agreement study ensures reliability
Evaluated LLM in real-world business context
🔎 Similar Papers
S
Sofia Della Penna
DIETI, Universit`a degli Studi di Napoli Federico II, Naples, Italy
Roberto Natella
Roberto Natella
Associate Professor, Università degli Studi di Napoli Federico II
Software DependabilitySoftware Security
V
Vittorio Orbinato
DIETI, Universit`a degli Studi di Napoli Federico II, Naples, Italy
L
Lorenzo Parracino
DIETI, Universit`a degli Studi di Napoli Federico II, Naples, Italy
L
Luciano Pianese
DIETI, Universit`a degli Studi di Napoli Federico II, Naples, Italy