Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing AI safety benchmarks lack verifiability and struggle to reconcile model intellectual property with dataset confidentiality—particularly under mutual distrust between model providers and auditors, posing severe privacy and trust risks. This paper proposes the first verifiable AI safety benchmarking framework integrating remote attestation and privacy-preserving computation. Leveraging Trusted Execution Environments (TEEs), it ensures encrypted isolation of model and data execution, secure multi-party interaction, and tamper-proof auditing. Crucially, it embeds TEE-based remote attestation into the AI auditing feedback loop, enabling external stakeholders to independently verify system compliance. A prototype implementation on Llama-3.1 successfully demonstrates end-to-end verifiable evaluation across standard safety benchmarks—including HarmBench and AdvBench—achieving strong integrity guarantees without exposing sensitive assets. The framework significantly strengthens trust foundations and practical feasibility for cross-organizational AI governance.

Technology Category

Application Category

📝 Abstract

Benchmarks are important measures to evaluate safety and compliance of AI models at scale. However, they typically do not offer verifiable results and lack confidentiality for model IP and benchmark datasets. We propose Attestable Audits, which run inside Trusted Execution Environments and enable users to verify interaction with a compliant AI model. Our work protects sensitive data even when model provider and auditor do not trust each other. This addresses verification challenges raised in recent AI governance frameworks. We build a prototype demonstrating feasibility on typical audit benchmarks against Llama-3.1.

Problem

Research questions and friction points this paper is trying to address.

Ensuring verifiable AI safety benchmark results

Protecting model IP and benchmark dataset confidentiality

Enabling trust between model providers and auditors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Trusted Execution Environments for verifiability

Protects sensitive AI model and benchmark data

Enables verification without mutual trust

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?