CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing CTI evaluation benchmarks suffer from three key limitations: (1) closed-book settings that disregard external knowledge bases, (2) narrow task coverage, and (3) support only single-source analysis—failing to reflect real-world, multi-source threat intelligence (CTI) analysis requirements. To address these gaps, we introduce CTIArena, the first knowledge-enhanced benchmark for heterogeneous multi-source CTI, encompassing nine tasks across structured, unstructured, and hybrid data modalities. CTIArena innovatively integrates a knowledge-enhanced paradigm combining retrieval-augmented generation (RAG) with parameterized memory to enable joint reasoning over cross-source, heterogeneous intelligence. Extensive experiments across ten mainstream LLMs demonstrate that knowledge enhancement significantly improves performance, reveals inherent deficiencies of general-purpose models in CTI tasks, and underscores the necessity of domain-specific adaptation. CTIArena establishes a systematic, rigorous standard for evaluating CTI-driven LLM capabilities.

Technology Category

Application Category

📝 Abstract

Cyber threat intelligence (CTI) is central to modern cybersecurity, providing critical insights for detecting and mitigating evolving threats. With the natural language understanding and reasoning capabilities of large language models (LLMs), there is increasing interest in applying them to CTI, which calls for benchmarks that can rigorously evaluate their performance. Several early efforts have studied LLMs on some CTI tasks but remain limited: (i) they adopt only closed-book settings, relying on parametric knowledge without leveraging CTI knowledge bases; (ii) they cover only a narrow set of tasks, lacking a systematic view of the CTI landscape; and (iii) they restrict evaluation to single-source analysis, unlike realistic scenarios that require reasoning across multiple sources. To fill these gaps, we present CTIArena, the first benchmark for evaluating LLM performance on heterogeneous, multi-source CTI under knowledge-augmented settings. CTIArena spans three categories, structured, unstructured, and hybrid, further divided into nine tasks that capture the breadth of CTI analysis in modern security operations. We evaluate ten widely used LLMs and find that most struggle in closed-book setups but show noticeable gains when augmented with security-specific knowledge through our designed retrieval-augmented techniques. These findings highlight the limitations of general-purpose LLMs and the need for domain-tailored techniques to fully unlock their potential for CTI.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM knowledge and reasoning across heterogeneous cyber threat intelligence

Assessing LLM performance under knowledge-augmented multi-source CTI scenarios

Benchmarking LLMs on structured, unstructured, and hybrid CTI analysis tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for multi-source cyber threat intelligence evaluation

Knowledge-augmented retrieval techniques for security domains

Structured unstructured hybrid CTI analysis framework

🔎 Similar Papers

KGV: Integrating Large Language Models with Knowledge Graphs for Cyber Threat Intelligence Credibility Assessment