Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

The practical utility of large language models (LLMs) for blue-team threat hunting remains unclear amid increasingly sophisticated cyber threats. Method: We propose CyberTeam—the first structured, task-oriented benchmark framework specifically designed for blue-team operations. It decomposes threat analysis into nine reproducible, modular operational units by explicitly modeling task dependencies, and defines a standardized two-stage workflow (threat attribution → incident response) covering 30 representative tasks. Contribution/Results: CyberTeam significantly enhances the reasoning controllability and result assessability of LLMs in realistic hunting scenarios. Empirical evaluation demonstrates that it substantially outperforms open-ended LLM inference strategies. Moreover, it provides the first systematic characterization of current LLMs’ capabilities and critical bottlenecks in end-to-end threat hunting—establishing a foundational benchmark for evaluating and advancing LLM-driven cybersecurity reasoning.

Technology Category

Application Category

📝 Abstract

As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat analysis. However, their effectiveness in real-world blue team threat-hunting scenarios remains insufficiently explored. This paper presents CyberTeam, a benchmark designed to guide LLMs in blue teaming practice. CyberTeam constructs a standardized workflow in two stages. First, it models realistic threat-hunting workflows by capturing the dependencies among analytical tasks from threat attribution to incident response. Next, each task is addressed through a set of operational modules tailored to its specific analytical requirements. This transforms threat hunting into a structured sequence of reasoning steps, with each step grounded in a discrete operation and ordered according to task-specific dependencies. Guided by this framework, LLMs are directed to perform threat-hunting tasks through modularized steps. Overall, CyberTeam integrates 30 tasks and 9 operational modules to guide LLMs through standardized threat analysis. We evaluate both leading LLMs and state-of-the-art cybersecurity agents, comparing CyberTeam against open-ended reasoning strategies. Our results highlight the improvements enabled by standardized design, while also revealing the limitations of open-ended reasoning in real-world threat hunting.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM effectiveness in real-world blue team threat hunting

Standardizing threat analysis workflow through modular operational steps

Benchmarking structured reasoning against open-ended cybersecurity strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized workflow models threat hunting dependencies

Modular operations transform tasks into structured reasoning steps

Benchmark integrates 30 tasks and 9 operational modules

🔎 Similar Papers

No similar papers found.