🤖 AI Summary
This paper addresses the challenge of identifying MITRE ATT&CK techniques in Cyber Threat Intelligence (CTI) reports, tackling three key issues: class imbalance, model overfitting, and domain-specific semantic complexity. We propose a novel two-stage paradigm: “LLM-based summarization preprocessing + SciBERT retraining.” First, open-source large language models (e.g., Llama2) generate high-quality synthetic samples to reconstruct a balanced training dataset. Second, a domain-adapted SciBERT model is fine-tuned on this enhanced data. Experiments demonstrate that our approach achieves F1-scores consistently exceeding 0.90 for critical ATT&CK technique identification—significantly outperforming baseline methods. The framework enhances automation efficiency in web-based CTI systems and strengthens human-AI collaborative defense. To our knowledge, this is the first work integrating LLM-driven data augmentation with lightweight domain-specific BERT for ATT&CK technique identification.
📝 Abstract
This work evaluates the performance of Cyber Threat Intelligence (CTI) extraction methods in identifying attack techniques from threat reports available on the web using the MITRE ATT&CK framework. We analyse four configurations utilising state-of-the-art tools, including the Threat Report ATT&CK Mapper (TRAM) and open-source Large Language Models (LLMs) such as Llama2. Our findings reveal significant challenges, including class imbalance, overfitting, and domain-specific complexity, which impede accurate technique extraction. To mitigate these issues, we propose a novel two-step pipeline: first, an LLM summarises the reports, and second, a retrained SciBERT model processes a rebalanced dataset augmented with LLM-generated data. This approach achieves an improvement in F1-scores compared to baseline models, with several attack techniques surpassing an F1-score of 0.90. Our contributions enhance the efficiency of web-based CTI systems and support collaborative cybersecurity operations in an interconnected digital landscape, paving the way for future research on integrating human-AI collaboration platforms.