🤖 AI Summary
This study addresses the high false-negative rates and unsupported annotations prevalent in extracting MITRE ATT&CK Tactics, Techniques, and Procedures (TTPs) from cyber threat intelligence reports. To tackle these issues, the authors propose a two-stage verification framework that emulates the workflow of human analysts: first, a large language model (LLM) broadly generates candidate TTPs; then, only those with explicit textual evidence are retained through dual verification involving deterministic snippet localization and alignment with official MITRE definitions. The work introduces two high-quality evaluation datasets, TRAM-Clean and TTPrint-Bench, and demonstrates the method’s generalizability across six LLM backbones. Experimental results show macro F1 scores of 76.48% and 87.39% on the respective datasets, representing improvements of 63.5% and 29.4% over the best-performing baseline, thereby substantially advancing the accuracy and robustness of document-level TTP extraction.
📝 Abstract
Extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports is an open-set, multi-label problem requiring both high recall (not missing techniques) and high precision (not hallucinating unsupported ones). Existing methods--rule-based, supervised, and LLM-based--struggle to achieve both: rule-based and supervised approaches lack generalizability across diverse attack descriptions, while LLM-based approaches that couple candidate generation and validation within a single inference step suffer from limited recall and precision simultaneously. We propose TTPrint, which addresses this challenge through a diverge-then-converge design inspired by how human analysts work: first extracting broadly, then verifying rigorously. In the divergent phase, reports are decomposed into atomic behaviors and candidate techniques are proposed broadly. A deterministic span localization stage then anchors each candidate to a specific evidence window in the source text. A convergent verification stage retains only candidates supported by both the localized evidence and the authoritative MITRE definition. We contribute two evaluation resources--a cleaned TRAM benchmark (TRAM-Clean) and a new annotated dataset (TTPrint-Bench)--to address known annotation noise in existing benchmarks and elevate the task to document-level TTP extraction. On TRAM-Clean and TTPrint-Bench, TTPrint achieves 76.48% and 87.39% macro-F1 respectively, outperforming the leading baseline by 63.5% and 29.4%. A multi-backbone analysis across six LLMs and a threshold sensitivity study further demonstrate generalizability across model choices and provide practical guidance for parameter selection.