🤖 AI Summary
This paper identifies and systematically analyzes three fundamental vulnerabilities of large language models (LLMs) in cybersecurity threat intelligence (CTI) applications—spurious correlation, contradictory knowledge, and limited generalization—rooted in the inherent dynamism, fragmentation, and semantic ambiguity of threat intelligence itself, not in model architecture. To address these, we propose a hierarchical refinement–human-in-the-loop evaluation framework integrating large-scale benchmarking on real-world threat reports, autoregressive refinement, and expert-supervised validation. Experiments across multiple CTI benchmarks confirm the prevalence and severity of these vulnerabilities, enabling the development of an automated threat prioritization system. Our key contributions are: (i) the first attribution of LLM failures in CTI to domain-specific ontological properties; (ii) a reproducible vulnerability diagnosis paradigm; and (iii) principled design guidelines for enhancing LLM robustness—thereby establishing both theoretical foundations and practical pathways for trustworthy, LLM-driven threat analysis.
📝 Abstract
Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats, wherein LLMs offer cyber threat intelligence (CTI) to support vulnerability assessment and incident response. While recent work has shown that LLMs can support a wide range of CTI tasks such as threat analysis, vulnerability detection, and intrusion defense, significant performance gaps persist in practical deployments. In this paper, we investigate the intrinsic vulnerabilities of LLMs in CTI, focusing on challenges that arise from the nature of the threat landscape itself rather than the model architecture. Using large-scale evaluations across multiple CTI benchmarks and real-world threat reports, we introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision to reliably analyze failure instances. Through extensive experiments and human inspections, we reveal three fundamental vulnerabilities: spurious correlations, contradictory knowledge, and constrained generalization, that limit LLMs in effectively supporting CTI. Subsequently, we provide actionable insights for designing more robust LLM-powered CTI systems to facilitate future research.