Expanding External Access To Frontier AI Models For Dangerous Capability Evaluations

📅 2026-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations in evaluating hazardous capabilities of frontier AI models—namely insufficient model access, informational opacity, and time constraints—which undermine the rigor and credibility of assessments. To remedy this, the paper proposes a structured access framework that systematically defines three access modalities—black-box, gray-box, and white-box—along three dimensions: model access rights, available information, and evaluation timeframes. It further introduces AL1–AL3, a descriptive three-tier access level taxonomy. This framework establishes the first standardized classification system for access in hazardous capability evaluations, bridging a critical gap between policy and practice. Informed by risk–benefit considerations and cross-industry safety mechanisms, it offers an operational interpretation of “appropriate access” as referenced in the EU’s General-Purpose AI Code of Conduct, thereby enhancing assessment quality, reducing false negatives, and fostering stakeholder trust.

Technology Category

Application Category

📝 Abstract
Frontier AI companies increasingly rely on external evaluations to assess risks from dangerous capabilities before deployment. However, external evaluators often receive limited model access, limited information, and little time, which can reduce evaluation rigour and confidence. The EU General-Purpose AI Code of Practice calls for"appropriate access", but does not specify what this means in practice. Furthermore, there is no common framework for describing different types and levels of evaluator access. To address this gap, we propose a taxonomy of access methods for dangerous capability evaluations. We disentangle three aspects of access: model access, model information, and evaluation timeframe. For each aspect, we review benefits and risks, including how expanding access can reduce false negatives and improve stakeholder trust, but can also increase security and capacity challenges. We argue that these limitations can likely be mitigated through technical means and safeguards used in other industries. Based on the taxonomy, we propose three descriptive access levels: AL1 (black-box model access and minimal information), AL2 (grey-box model access and substantial information), and AL3 (white-box model access and comprehensive information), to support clearer communication between evaluators, frontier AI companies, and policymakers. We believe these levels correspond to the different standards for appropriate access defined in the EU Code of Practice, though these standards may change over time.
Problem

Research questions and friction points this paper is trying to address.

dangerous capability evaluations
external access
AI model evaluation
access taxonomy
frontier AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

access taxonomy
dangerous capability evaluation
frontier AI governance
model transparency
evaluation framework
🔎 Similar Papers
J
Jacob Charnock
ERA Cambridge
A
Alejandro Tlaie
Pour Demain
K
Kyle O'Brien
ERA Cambridge
Stephen Casper
Stephen Casper
PhD student, MIT
AI safetyAI responsibilityred-teamingrobustnessauditing
A
Aidan Homewood
GovAI