🤖 AI Summary
This work addresses the limitations of monolithic large language models (LLMs) in automated cyber intrusion tasks, which are prone to context drift and error propagation and lack explicit modeling of role boundaries, provenance tracking, and cost constraints. The authors propose CAESAR, a novel multi-agent LLM framework that introduces structured role specialization by decomposing the intrusion process into five distinct agent roles. Coordination is achieved through mechanisms including round-limited protocols, a persistent knowledge base, per-round workspaces, validator-gated knowledge enhancement, and capability token-based write isolation. This architecture yields interpretable, monitorable behavioral signals and supports cross-scenario transferability. Evaluated on 25 CTF challenges, CAESAR significantly outperforms single-agent baselines, particularly in multi-step vulnerability chaining tasks, where it achieves higher success rates and reduced performance variance, demonstrating both effectiveness and strong generalization.
📝 Abstract
Automated intrusion-style workflows require LLM agents to reason over partial observations, tool outputs, and executable artifacts under bounded budgets. A single LLM instance often compresses evidence extraction, planning, execution, and validation into one context, which increases the risk of context drift and error propagation. Existing LLM-based multi-agent systems support general collaboration, but they do not explicitly model the role boundaries, artifact provenance, and cost constraints that characterize multi-stage intrusion workflows.
This paper presents CAESAR, a coordinated multi-agent framework for controlled analysis of LLM-agent behavior in intrusion-style tasks. CAESAR decomposes the workflow into five typed roles and coordinates them through a bounded round protocol with a persistent knowledge base, a per-round workspace, validator-gated knowledge promotion, and capability-token write isolation. We evaluate CAESAR on 25 CTF tasks across five categories and four LLM backends. Compared with a single-agent baseline under matched budgets and tool access, CAESAR improves task success and reduces performance variance, with larger gains on tasks requiring multi-step exploit composition. A secondary simulated interactional-security study suggests that the role structure can transfer beyond code-native surfaces. The results indicate that role transitions, artifact provenance, and knowledge-promotion events provide useful structural signals for monitoring coordinated LLM-agent behavior beyond individual prompt and output inspection. The dataset, implementation, and evaluation logs are released at https://github.com/Xu-Qiu/CMAS.