🤖 AI Summary
This study investigates the use of large language model agents (LMAs) to automate advanced persistent threat behaviors—such as lateral movement—in red teaming exercises, while addressing reliability and governance challenges inherent in real-world adversarial environments. Grounded in the MITRE ATT&CK framework, we evaluate LMA performance across three operational modes—fully autonomous execution, self-guided planning, and expert-defined playbooks—within a controlled adversarial simulation environment. Agents interact with instrumented network proxies, observe execution traces, and iteratively adapt based on feedback. We introduce, for the first time in red teaming, an LLM-as-a-Judge paradigm to deterministically validate attack chains and systematically delineate LMA capability boundaries. Results show that expert-defined playbooks achieve the highest completion rates, yet all modes frequently fail, primarily due to fragile command invocation, environmental instability, and errors in credential and state management.
📝 Abstract
Language Model Agents (LMAs) are emerging as a powerful primitive for augmenting red-team operations. They can support attack planning, adversary emulation, and the orchestration of multi-step activity such as lateral movement, a core enabling capability of advanced persistent threat (APT) campaigns. Using frameworks such as MITRE ATT&CK, we analyze where these agents intersect with core offensive functions and assess current strengths and limitations of LMAs with an emphasis on governance and realistic evaluation. We benchmark LMAs across two lateral-movement scenarios in a controlled adversary-emulation environment, where LMAs interact with instrumented cyber agents, observe execution artifacts, and iteratively adapt based on environmental feedback. Each scenario is formalized as an ordered task chain with explicit validation predicates, leveraging an LLM-as-a-Judge paradigm to ensure deterministic outcome verification.
We compare three operational modalities: fully autonomous execution, self-scaffolded planning, and expert-defined action plans. Preliminary findings indicate that expert-defined action plans yield higher task-completion rates relative to other operational modes. However, failure remains frequent across all modalities, largely attributable to brittle command invocation, environmental and deployment instability, and recurring errors in credential management and state handling.