DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic, reproducible evaluation of large language models (LLMs) in cybersecurity tasks. To bridge this gap, we propose DefenderBench—the first open-source benchmark comprehensively covering both offensive and defensive perspectives, as well as domain-specific knowledge dimensions. Built upon a standardized agent framework, DefenderBench integrates multiple controllable cybersecurity simulation environments—including intrusion detection, malicious content identification, vulnerability analysis, and security knowledge assessment—enabling fair, apples-to-apples comparison across proprietary and open-weight LLMs. Its modular architecture and unified scoring mechanism ensure evaluation reproducibility and extensibility. Experimental results show that Claude-3.7-Sonnet achieves the highest overall score (81.65), while Llama-3.3-70B attains the best performance among open models (71.81). The benchmark implementation, evaluation suite, and all test cases are publicly released to foster community-driven customization and advancement.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents in cybersecurity tasks
Assessing offense, defense, and knowledge-based capabilities
Providing affordable, rigorous toolkit for fair comparisons
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source toolkit for cybersecurity agent evaluation
Modular design for custom LLM and task integration
Standardized framework for rigorous performance benchmarking