🤖 AI Summary
Existing autonomous driving evaluation benchmarks struggle to jointly model multi-objective priority rules and formalized scenarios, limiting their ability to effectively assess a system’s capacity to balance safety, compliance, and efficiency in complex traffic situations. This work proposes the first evaluation framework that integrates multi-objective priority rules with formally abstracted driving scenarios. The framework leverages the Scenic language to construct a compact yet diverse set of driving scenarios, introduces an interpretable and extensible hierarchical rule ontology to encode objective priorities, and defines quantitative metrics grounded in formal specifications. Experimental results demonstrate strong alignment between the framework’s assessments and human driving judgments, effectively exposing behavioral deficiencies of agents under conflicting priorities and providing high-discriminability evaluation capabilities for autonomous driving systems.
📝 Abstract
Developing autonomous driving systems for complex traffic environments requires balancing multiple objectives, such as avoiding collisions, obeying traffic rules, and making efficient progress. In many situations, these objectives cannot be satisfied simultaneously, and explicit priority relations naturally arise. Also, driving rules require context, so it is important to formally model the environment scenarios within which such rules apply. Existing benchmarks for evaluating autonomous vehicles lack such combinations of multi-objective prioritized rules and formal environment models. In this work, we introduce ScenicRules, a benchmark for evaluating autonomous driving systems in stochastic environments under prioritized multi-objective specifications. We first formalize a diverse set of objectives to serve as quantitative evaluation metrics. Next, we design a Hierarchical Rulebook framework that encodes multiple objectives and their priority relations in an interpretable and adaptable manner. We then construct a compact yet representative collection of scenarios spanning diverse driving contexts and near-accident situations, formally modeled in the Scenic language. Experimental results show that our formalized objectives and Hierarchical Rulebooks align well with human driving judgments and that our benchmark effectively exposes agent failures with respect to the prioritized objectives. Our benchmark can be accessed at https://github.com/BerkeleyLearnVerify/ScenicRules/.