๐ค AI Summary
This study addresses the lack of systematic guidance on how contextual representation, reasoning mechanisms, and task hierarchies affect performance and inference costs of composite large language model (LLM) agents in adversarial partially observable environments, such as the CybORG CAGE-2 cyber defense POMDP. Through controlled experiments evaluating five model categories, six instances, and twelve configurations, the work reveals that procedural state abstraction yields the highest return per token and identifies a โreasoning cascadeโ phenomenon: stacking complex reasoning within hierarchical structures significantly degrades performance. Procedural state tracking improves average return by up to 76%, while hierarchical architectures without explicit reasoning achieve the best absolute performance; adding reasoning layers reduces returns by 3.4ร and increases token consumption by 1.8โ2.7ร. These findings establish a design principle prioritizing investment in procedural infrastructure over deep monolithic reasoning.
๐ Abstract
Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.