Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenge that AI agent developers may falsely claim the implementation of safety measures, while users lack practical means to verify whether protective mechanisms are genuinely enforced. To bridge this trust gap, the paper proposes a verifiable “proof of protection” mechanism grounded in Trusted Execution Environments (TEEs). By leveraging remote attestation and cryptographic proofs, the approach ensures that AI responses have indeed been processed by specified open-source safety rules, all while preserving model privacy. The authors implement a prototype on the OpenClaw agent, demonstrating feasibility, quantifying latency and deployment overhead, and uncovering potential adversarial bypass risks. This study presents the first system enabling offline verifiability of safety policy enforcement, achieving a balance between privacy preservation and trustworthy execution.

Technology Category

Application Category

📝 Abstract

As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard

Problem

Research questions and friction points this paper is trying to address.

AI safety

trust

guardrail verification

misleading claims

agent deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proof-of-Guardrail

Trusted Execution Environment

AI safety