AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A benchmark for evaluating large language models’ (LLMs) capability in AI regulatory compliance assessment—particularly under the EU Artificial Intelligence Act (AI Act)—is currently absent. Method: We introduce EU-AIRegBench, the first dedicated benchmark for AI Act compliance evaluation, comprising 120 technically nuanced document excerpts generated by LLMs and rigorously annotated by legal experts. It covers core high-risk AI system provisions. Our data construction paradigm integrates structured prompt engineering with expert legal validation, enabling reproducible multi-dimensional evaluation—including accuracy, clause coverage, and reasoning fidelity. Contribution/Results: Experiments reveal severe limitations in state-of-the-art LLMs’ compliance judgment (mean accuracy: 58.3%), exposing critical gaps in legal reasoning. EU-AIRegBench fills a key gap in AI governance evaluation infrastructure; we publicly release the dataset and evaluation code to advance research on trustworthy, regulation-aware AI systems.

Technology Category

Application Category

📝 Abstract
As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system - of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts' compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs for AI regulation compliance assessment
Creating dataset to test LLM performance on EU AI Act
Evaluating limitations of LLM-based AI regulation compliance tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created benchmark dataset for AI regulation compliance
Generated fictional AI system documentation using LLMs
Annotated compliance violations with legal expert review
🔎 Similar Papers
No similar papers found.
Bill Marino
Bill Marino
PhD student, University of Cambridge
machine learning
Rosco Hunter
Rosco Hunter
University of Warwick
AI safety and efficiency
Z
Zubair Jamali
M
Marinos Emmanouil Kalpakos
University of Luxembourg
M
Mudra Kashyap
I
Isaiah Hinton
A
Alexa Hanson
M
Maahum Nazir
C
Christoph Schnabl
Felix Steffek
Felix Steffek
Professor of Law, University of Cambridge
Law
Hongkai Wen
Hongkai Wen
University of Warwick
Machine LearningML/AI SystemsCyber-Physical Systems
N
Nicholas D. Lane
University of Cambridge