CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work investigates the pronounced attack selection bias exhibited by large language models (LLMs) when deployed as autonomous agents in offensive cybersecurity, revealing a tendency to over-concentrate on specific attack families while neglecting other viable strategies. The authors introduce CyBiasBench, a benchmark comprising 630 dialogue sessions across three targets, four prompting conditions, and ten attack categories, to systematically evaluate the behavioral distributions of five prominent LLM agents. They formally characterize the intrinsic bias of LLMs in cyber attacks and identify a “bias momentum” phenomenon—where forced redirection toward conflicting attack families fails to enhance effectiveness. Results demonstrate that each agent exhibits a stable yet heterogeneous dominant attack family and distributional entropy, with no direct correlation observed between bias and attack success rate. The complete toolchain and an interactive dashboard are publicly released.

📝 Abstract

Large language models (LLMs) are increasingly deployed as autonomous agents in offensive cybersecurity. In this paper, we reveal an interesting phenomenon: different agents exhibit distinct attack patterns. Specifically, each agent exhibits an attack-selection bias, disproportionately concentrating its efforts on a narrow subset of attack families regardless of prompt variations. To systematically quantify this behavior, we introduce CyBiasBench, a comprehensive 630-session benchmark that evaluates five agents on three targets and four prompt conditions with ten attack families. We identify explicit bias across agents, with different dominant attack families and varying entropy levels in their attack-family allocation distributions. Such bias is better characterized as a trait of the agents, rather than a factor associated with the attack success rate. Furthermore, our experiments reveal a bias momentum effect, where agents resist explicit steering toward attack families that conflict with their bias. This forced distribution shift does not yield measurable improvements in attack performance. To ensure reproducibility and facilitate future research, we release an interactive result dashboard at https://trustworthyai.co.kr/CyBiasBench/ and a reproducibility artifact with aggregated session-level statistics and full evaluation scripts at https://github.com/Harry24k/CyBiasBench.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

attack-selection bias

cybersecurity

bias characterization

autonomous agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

attack-selection bias

CyBiasBench

LLM agents