🤖 AI Summary
This work addresses the frequent non-compliance of emergency response plans generated by large language models (LLMs), which often violate mandatory procedural steps, action sequencing, or approval requirements. The paper proposes the first verifiable framework for quantitatively assessing LLM-assisted plan compliance, leveraging a fixed set of incident scenarios, a structured action catalog, and explicit policy rules to perform deterministic validation of action trajectories within generation boundaries. By integrating SOAR system data and a public evidence surface, the framework enables zero-cost readiness checks for mandatory and sequential constraint enforcement. Experimental results demonstrate that the approach successfully eliminates 466 non-compliant approval-gated actions, maintains stable compliance rates across three re-runs, and preserves baseline task recall without degradation.
📝 Abstract
Security operations centers (SOCs) are beginning to use large language models (LLMs) as copilots to draft incident-response plans. These plans may include actions that are valid per the catalog but still violate mandatory steps, required ordering, or approval gates before analyst review. SOCpilot makes this compliance question measurable at the plan boundary. It fixes the incident package, action catalog, policy rules, verifier, and public evidence surface. Next, it verifies the copilot's proposed action trace. We evaluate two LLM providers on 200 real incidents from an anonymized production SOC in a financial-sector case study. We compare their plans to paired analyst-authored references from the same security orchestration, automation, and response (SOAR) cases. An identical inline policy text moves the two providers in opposite directions. A deterministic verifier removes 466 non-compliant, approval-gated actions, without reducing baseline-task recall. Aggregate rates remain stable across 3 reruns of the fixed corpus. The official evidence focuses on approval-gated decisions regarding recovery and containment. Separately, the artifact exposes zero-cost readiness checks for mandatory and ordering repairs. We release the runnable artifact so independent reviewers can rederive the public results without access to private incident data.