Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

๐Ÿ“… 2026-03-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge that AI agent developers may falsely claim the implementation of safety measures, while users lack practical means to verify whether protective mechanisms are genuinely enforced. To bridge this trust gap, the paper proposes a verifiable โ€œproof of protectionโ€ mechanism grounded in Trusted Execution Environments (TEEs). By leveraging remote attestation and cryptographic proofs, the approach ensures that AI responses have indeed been processed by specified open-source safety rules, all while preserving model privacy. The authors implement a prototype on the OpenClaw agent, demonstrating feasibility, quantifying latency and deployment overhead, and uncovering potential adversarial bypass risks. This study presents the first system enabling offline verifiability of safety policy enforcement, achieving a balance between privacy preservation and trustworthy execution.

Technology Category

Application Category

๐Ÿ“ Abstract
As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard
Problem

Research questions and friction points this paper is trying to address.

AI safety
trust
guardrail verification
misleading claims
agent deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proof-of-Guardrail
Trusted Execution Environment
AI safety
verifiable execution
guardrail integrity
๐Ÿ”Ž Similar Papers