Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the absence of a systematic benchmark for evaluating the collaborative defense capabilities of multi-agent AI systems in blue-team security operations. Existing approaches are largely confined to single-task scenarios, which are insufficient to support the development of autonomous Security Operations Centers (SOCs). To bridge this gap, the paper proposes design principles for a multi-task collaborative blue-team benchmark and introduces SOC-bench, a conceptual framework grounded in the cybersecurity incident response lifecycle. SOC-bench encompasses five core tasks: detection, analysis, containment, recovery, and post-incident review. It enables, for the first time, a comprehensive and systematic assessment of multi-agent AI systems’ defensive performance in large-scale ransomware attack scenarios, thereby establishing a theoretical and structural foundation for quantifiable and scalable evaluation of SOC automation.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) and multi-agent AI systems are demonstrating increasing potential in cybersecurity operations, organizations, policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such AI systems to achieve more autonomous SOCs (security operation centers) and reduce manual effort. In particular, the AI and cybersecurity communities have recently developed several benchmarks for evaluating the red team capabilities of multi-agent AI systems. However, because the operations in SOCs are dominated by blue team operations, the capabilities of AI systems & agents to achieve more autonomous SOCs cannot be evaluated without a benchmark focused on blue team operations. To our best knowledge, no systematic benchmark for evaluating coordinated multi-task blue team AI has been proposed in the literature. Existing blue team benchmarks focus on a particular task. The goal of this work is to develop a set of design principles for the construction of a benchmark, which is denoted as SOC-bench, to evaluate the blue team capabilities of AI. Following these design principles, we have developed a conceptual design of SOC-bench, which consists of a family of five blue team tasks in the context of large-scale ransomware attack incident response.

Problem

Research questions and friction points this paper is trying to address.

multi-agent AI systems

blue team operations

security operation centers

benchmark

cybersecurity

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent AI systems

blue team operations

security operation center (SOC)