🤖 AI Summary
Existing AI safety benchmarks lack verifiability and struggle to reconcile model intellectual property with dataset confidentiality—particularly under mutual distrust between model providers and auditors, posing severe privacy and trust risks. This paper proposes the first verifiable AI safety benchmarking framework integrating remote attestation and privacy-preserving computation. Leveraging Trusted Execution Environments (TEEs), it ensures encrypted isolation of model and data execution, secure multi-party interaction, and tamper-proof auditing. Crucially, it embeds TEE-based remote attestation into the AI auditing feedback loop, enabling external stakeholders to independently verify system compliance. A prototype implementation on Llama-3.1 successfully demonstrates end-to-end verifiable evaluation across standard safety benchmarks—including HarmBench and AdvBench—achieving strong integrity guarantees without exposing sensitive assets. The framework significantly strengthens trust foundations and practical feasibility for cross-organizational AI governance.
📝 Abstract
Benchmarks are important measures to evaluate safety and compliance of AI models at scale. However, they typically do not offer verifiable results and lack confidentiality for model IP and benchmark datasets. We propose Attestable Audits, which run inside Trusted Execution Environments and enable users to verify interaction with a compliant AI model. Our work protects sensitive data even when model provider and auditor do not trust each other. This addresses verification challenges raised in recent AI governance frameworks. We build a prototype demonstrating feasibility on typical audit benchmarks against Llama-3.1.