🤖 AI Summary
This work addresses the asymmetric control challenge in intelligent security systems, where strong adversarial capabilities must be preserved within authorized boundaries yet strictly constrained beyond them. To this end, the paper introduces an Alignment Contract Framework that, for the first time, formally models behavioral boundaries as observable-effect-based contracts. The framework employs formal specifications of scope, permitted/prohibited effects, resource budgets, and disclosure policies, integrated with finite-trace semantics and safety property characterizations to support contract refinement and unidirectional composition. Core decidability theorems are formally verified in Lean 4. An instantiation in web security workflows demonstrates enforcement correctness of monitored execution under the assumption of effect observability, establishes undecidability boundaries, and enables modular engineering and cross-task transfer.
📝 Abstract
Agentic security systems increasingly combine LLM planners with tools that can discover, validate, and report vulnerabilities. This creates an asymmetric control problem: the system should retain strong offensive capability inside an authorized engagement, while the same capabilities must be denied outside scope. Existing guardrails provide useful policy controls, but they do not make this boundary a first-class formal contract over observable effects.
We introduce alignment contracts, a framework for specifying and enforcing behavioral constraints over observable effect traces. A contract defines scope, allowed and forbidden effects, resource budgets, and disclosure policies. We give the language finite-trace semantics, characterize satisfaction as a safety property with finite violation witnesses, develop refinement and one-way composition rules for modular contract engineering, and show that admissibility checking is decidable. We instantiate the framework for web-focused agentic security workflows and show how the same structure extends to other effect profiles.
Under an explicit Effect Observability Assumption, where all $\SigmaEff$-effects are mediated, the soundness theorem quantifies over the agent model and gives guarantees for mediated $\SigmaEff$-effects, including enforcement soundness for monitor-realized traces. We also state an assumption-lifted adaptation result and formalize limits through undecidability transfer and observability-boundary theorems. A Lean 4 artifact checks the formal core theorems used by the paper.