Benchmarking LLMs in an Embodied Environment for Blue Team Threat Hlunting

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Large language models (LLMs) lack rigorous evaluation and exhibit uncontrolled execution in real-world blue-team threat hunting. Method: This paper introduces CYBERTEAM, the first embodied threat-hunting benchmark tailored to blue-team operational practice. It proposes a task-dependent modeling framework and encapsulates nine domain-specific embodied functions, decomposing end-to-end analysis into a three-level, composable, and verifiable execution paradigm: “task–function–workflow.” The benchmark integrates 30 realistic threat-hunting tasks to systematically evaluate mainstream LLMs and security agents. Contribution/Results: Experiments demonstrate that embodied function invocation significantly outperforms conventional prompting techniques. Moreover, the evaluation uncovers critical bottlenecks in current models—including reasoning coherence, context fidelity, and operational robustness—highlighting key gaps between LLM capabilities and practical blue-team requirements.

Technology Category

Application Category

📝 Abstract

As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat analysis. However, their effectiveness in real-world blue team threat-hunting scenarios remains insufficiently explored. In this paper, we present CYBERTEAM, a benchmark designed to guide LLMs in blue teaming practice. CYBERTEAM constructs an embodied environment in two stages. First, it models realistic threat-hunting workflows by capturing the dependencies among analytical tasks from threat attribution to incident response. Next, each task is addressed through a set of embodied functions tailored to its specific analytical requirements. This transforms the overall threat-hunting process into a structured sequence of function-driven operations, where each node represents a discrete function and edges define the execution order. Guided by this framework, LLMs are directed to perform threat-hunting tasks through modular steps. Overall, CYBERTEAM integrates 30 tasks and 9 embodied functions, guiding LLMs through pipelined threat analysis. We evaluate leading LLMs and state-of-the-art cybersecurity agents, comparing CYBERTEAM's embodied function-calling against fundamental elicitation strategies. Our results offer valuable insights into the current capabilities and limitations of LLMs in threat hunting, laying the foundation for the practical adoption in real-world cybersecurity applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in realistic cyber threat-hunting scenarios

Assessing LLM effectiveness in modular threat analysis workflows

Comparing embodied function-calling with basic elicitation strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied environment models threat-hunting workflows

Function-driven operations structure threat analysis

Modular steps guide LLMs in cybersecurity tasks

🔎 Similar Papers

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak