When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

175K/year
🤖 AI Summary
This work addresses the challenge of effectively comparing the safety of large language models in the absence of labeled benchmarks tailored to specific languages, domains, or regulatory contexts. The authors propose a label-free comparative safety scoring framework that constructs a validity chain grounded in tool effectiveness by fixing scenario bundles, scoring criteria, auditors, adjudication mechanisms, sampling configurations, and rerun budgets. Integrating scenario-driven auditing (SimpleAudit), variance decomposition (η²), AUROC evaluation, and multi-round rerun stability checks, the approach emphasizes that safety scores must be reported alongside their full audit configuration rather than reduced to a single ranking. Evaluated on a Norwegian safety scenario bundle, the framework achieves AUROCs of 0.89–1.00, attributes approximately 52% of outcome variance to the target model, demonstrates convergence of severity profiles within ten reruns, and successfully enables a safety comparison between Borealis and Gemma 3 for public-sector procurement.
📝 Abstract
Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values between 0.89 and 1.00, target identity is the dominant variance component ($η^2 \approx 0.52$), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking.
Problem

Research questions and friction points this paper is trying to address.

benchmarkless evaluation
LLM safety
comparative scoring
ground-truth labels
scenario-based audit
Innovation

Methods, ideas, or system contributions that make the work stand out.

benchmarkless evaluation
comparative safety scoring
instrumental validity
scenario-based audit
LLM safety
S
Sushant Gautam
Simula Metropolitan Center for Digital Engineering, Oslo Metropolitan University
F
Finn Schwall
Oslo Metropolitan University, Simula Research Laboratory
A
Annika Willoch Olstad
Oslo Metropolitan University, Simula Research Laboratory
F
Fernando Vallecillos Ruiz
University of Oslo, Simula Research Laboratory
B
Birk Torpmann-Hagen
Simula Research Laboratory
S
Sunniva Maria Stordal Bjørklund
Norwegian Directorate of Health
Leon Moonen
Leon Moonen
Full Professor, Simula Research Laboratory & BI Norwegian Business School
software engineeringsoftware securitysoftware analyticsdata mining & machine learningreverse
K
Klas Pettersen
Simula Metropolitan Center for Digital Engineering
M
Michael A. Riegler
Simula Metropolitan Center for Digital Engineering, Oslo Metropolitan University