Towards Effective Identification of Attack Techniques in Cyber Threat Intelligence Reports using Large Language Models

📅 2025-05-06
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of identifying MITRE ATT&CK techniques in Cyber Threat Intelligence (CTI) reports, tackling three key issues: class imbalance, model overfitting, and domain-specific semantic complexity. We propose a novel two-stage paradigm: “LLM-based summarization preprocessing + SciBERT retraining.” First, open-source large language models (e.g., Llama2) generate high-quality synthetic samples to reconstruct a balanced training dataset. Second, a domain-adapted SciBERT model is fine-tuned on this enhanced data. Experiments demonstrate that our approach achieves F1-scores consistently exceeding 0.90 for critical ATT&CK technique identification—significantly outperforming baseline methods. The framework enhances automation efficiency in web-based CTI systems and strengthens human-AI collaborative defense. To our knowledge, this is the first work integrating LLM-driven data augmentation with lightweight domain-specific BERT for ATT&CK technique identification.

Technology Category

Application Category

📝 Abstract
This work evaluates the performance of Cyber Threat Intelligence (CTI) extraction methods in identifying attack techniques from threat reports available on the web using the MITRE ATT&CK framework. We analyse four configurations utilising state-of-the-art tools, including the Threat Report ATT&CK Mapper (TRAM) and open-source Large Language Models (LLMs) such as Llama2. Our findings reveal significant challenges, including class imbalance, overfitting, and domain-specific complexity, which impede accurate technique extraction. To mitigate these issues, we propose a novel two-step pipeline: first, an LLM summarises the reports, and second, a retrained SciBERT model processes a rebalanced dataset augmented with LLM-generated data. This approach achieves an improvement in F1-scores compared to baseline models, with several attack techniques surpassing an F1-score of 0.90. Our contributions enhance the efficiency of web-based CTI systems and support collaborative cybersecurity operations in an interconnected digital landscape, paving the way for future research on integrating human-AI collaboration platforms.
Problem

Research questions and friction points this paper is trying to address.

Evaluating CTI extraction methods for attack technique identification
Addressing class imbalance and domain complexity in technique extraction
Proposing a two-step LLM and SciBERT pipeline for improved accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs to summarize cyber threat reports
Retrains SciBERT on rebalanced LLM-augmented data
Improves F1-scores for attack technique identification
🔎 Similar Papers
No similar papers found.