Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

πŸ“… 2026-03-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the absence of a systematic benchmark for evaluating the collaborative defense capabilities of multi-agent AI systems in blue-team security operations. Existing approaches are largely confined to single-task scenarios, which are insufficient to support the development of autonomous Security Operations Centers (SOCs). To bridge this gap, the paper proposes design principles for a multi-task collaborative blue-team benchmark and introduces SOC-bench, a conceptual framework grounded in the cybersecurity incident response lifecycle. SOC-bench encompasses five core tasks: detection, analysis, containment, recovery, and post-incident review. It enables, for the first time, a comprehensive and systematic assessment of multi-agent AI systems’ defensive performance in large-scale ransomware attack scenarios, thereby establishing a theoretical and structural foundation for quantifiable and scalable evaluation of SOC automation.
πŸ“ Abstract
As Large Language Models (LLMs) and multi-agent AI systems are demonstrating increasing potential in cybersecurity operations, organizations, policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such AI systems to achieve more autonomous SOCs (security operation centers) and reduce manual effort. In particular, the AI and cybersecurity communities have recently developed several benchmarks for evaluating the red team capabilities of multi-agent AI systems. However, because the operations in SOCs are dominated by blue team operations, the capabilities of AI systems & agents to achieve more autonomous SOCs cannot be evaluated without a benchmark focused on blue team operations. To our best knowledge, no systematic benchmark for evaluating coordinated multi-task blue team AI has been proposed in the literature. Existing blue team benchmarks focus on a particular task. The goal of this work is to develop a set of design principles for the construction of a benchmark, which is denoted as SOC-bench, to evaluate the blue team capabilities of AI. Following these design principles, we have developed a conceptual design of SOC-bench, which consists of a family of five blue team tasks in the context of large-scale ransomware attack incident response.
Problem

Research questions and friction points this paper is trying to address.

multi-agent AI systems
blue team operations
security operation centers
benchmark
cybersecurity
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent AI systems
blue team operations
security operation center (SOC)
benchmark design
ransomware incident response
πŸ”Ž Similar Papers
No similar papers found.
Y
Yicheng Cai
Cyber Security Laboratory, Pennsylvania State University, University Park
M
Mitchell John DeStefano
Cyber Security Laboratory, Pennsylvania State University, University Park
G
Guodong Dong
Cyber Security Laboratory, Pennsylvania State University, University Park
P
Pulkit Handa
Cyber Security Laboratory, Pennsylvania State University, University Park
Peng Liu
Peng Liu
Raymond G. Tronzo, M.D. Professor of Cybersecurity, Penn State University
Systems Security - AI Security
T
Tejas Singhal
Cyber Security Laboratory, Pennsylvania State University, University Park
P
Peiyu Tseng
Cyber Security Laboratory, Pennsylvania State University, University Park
W
Winston Jen White
Cyber Security Laboratory, Pennsylvania State University, University Park