π€ AI Summary
This study addresses the critical gap in objective, computable metrics for quantifying the ethical and legal compliance of autonomous systems, which has hindered their evaluability and the development of robust accountability mechanisms. To bridge this gap, the authors propose a large language model framework integrating neuro-symbolic methods that maps system behaviors onto an interpretable βAutonomy Readiness Levelβ (ARL) scale through high-fidelity simulation and automated test generation. This approach enables, for the first time, objective and reproducible benchmark scoring of ethical performance in white-box autonomous systems, effectively closing the divide between abstract ethical principles and verifiable, accountable behaviors.
π Abstract
As autonomous systems grow more advanced, objective metrics to evaluate their ethical and legal compliance are critical for informing end users of their limitations and ensuring accountability of those who misuse them. Current ethical embodied AI frameworks remain mostly qualitative, focusing on system design (through safety guardrails or targeted red teaming), and the realized guardrails often directly disallow unsafe behavior without providing the user with an override or interpretable reason. Instead, there is a need for computable metrics through rigorous testing that allow a user to determine the applicability of the system to the task. To address this gap, we introduce the Reference Ethical Benchmark for Autonomy Readiness (REBAR), a quantitative test and evaluation framework for autonomous systems. REBAR maps operating metrics into a computable Autonomy Readiness Level (ARL) rubric that can quantify ethical performance. Key innovations of the framework include a neuro-symbolic Large Language Model (LLM) approach to calculate and explain the ethical difficulty of scenarios, LLM-driven at-scale generation of test instances, and a versatile, photorealistic simulation environment. By evaluating white-box autonomy solutions through this rigorous testing pipeline, REBAR delivers an objective and repeatable benchmark score, bridging the gap between abstract principles and verifiable, accountable autonomy.