🤖 AI Summary
Traditional cyber threat intelligence (CTI) credibility assessment predominantly relies on static classification paradigms—either handcrafted feature engineering or isolated deep learning models—rendering it ill-suited for handling CTI’s inherent incompleteness, heterogeneity, and noise, while also lacking interpretability. To address these limitations, this paper proposes a large language model (LLM)-based multi-step reasoning framework that integrates adaptive information extraction, iterative evidence retrieval, and prompt-driven natural language inference to enable dynamic and transparent credibility evaluation. The framework end-to-end models the entire CTI verification process, substantially enhancing decision robustness and auditability. Evaluated on the CTI-200 and PolitiFact benchmarks, it achieves 90.9% macro-F1 and 93.6% micro-F1, outperforming state-of-the-art approaches by a statistically significant margin.
📝 Abstract
Verifying the credibility of Cyber Threat Intelligence (CTI) is essential for reliable cybersecurity defense. However, traditional approaches typically treat this task as a static classification problem, relying on handcrafted features or isolated deep learning models. These methods often lack the robustness needed to handle incomplete, heterogeneous, or noisy intelligence, and they provide limited transparency in decision-making-factors that reduce their effectiveness in real-world threat environments. To address these limitations, we propose LRCTI, a Large Language Model (LLM)-based framework designed for multi-step CTI credibility verification. The framework first employs a text summarization module to distill complex intelligence reports into concise and actionable threat claims. It then uses an adaptive multi-step evidence retrieval mechanism that iteratively identifies and refines supporting information from a CTI-specific corpus, guided by LLM feedback. Finally, a prompt-based Natural Language Inference (NLI) module is applied to evaluate the credibility of each claim while generating interpretable justifications for the classification outcome. Experiments conducted on two benchmark datasets, CTI-200 and PolitiFact show that LRCTI improves F1-Macro and F1-Micro scores by over 5%, reaching 90.9% and 93.6%, respectively, compared to state-of-the-art baselines. These results demonstrate that LRCTI effectively addresses the core limitations of prior methods, offering a scalable, accurate, and explainable solution for automated CTI credibility verification