Beyond Simulation: Benchmarking World Models for Planning and Causality in Autonomous Driving

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating world models as surrogate environments for autonomous driving policy training remains challenging, as existing metrics overlook the causal influence of the ego vehicle on surrounding traffic—a critical factor for assessing robustness in realistic training scenarios. Method: We propose the first evaluation framework explicitly designed for causal agents and partial replay, grounded in the Waymo Open Sim-Agents Challenge. It introduces a metametric benchmark that quantifies model sensitivity to exogenous behaviors of uncontrollable traffic participants. Contribution/Results: Experiments reveal that several top-performing world models—deemed robust under standard evaluation—exhibit significant degradation when subjected to causal agent perturbations or partial replay. Our novel metrics effectively discriminate between models in terms of training-time robustness, offering a more policy-relevant assessment criterion than conventional metrics. This framework advances principled, causally grounded evaluation of world models for autonomous driving.

Technology Category

Application Category

📝 Abstract
World models have become increasingly popular in acting as learned traffic simulators. Recent work has explored replacing traditional traffic simulators with world models for policy training. In this work, we explore the robustness of existing metrics to evaluate world models as traffic simulators to see if the same metrics are suitable for evaluating a world model as a pseudo-environment for policy training. Specifically, we analyze the metametric employed by the Waymo Open Sim-Agents Challenge (WOSAC) and compare world model predictions on standard scenarios where the agents are fully or partially controlled by the world model (partial replay). Furthermore, since we are interested in evaluating the ego action-conditioned world model, we extend the standard WOSAC evaluation domain to include agents that are causal to the ego vehicle. Our evaluations reveal a significant number of scenarios where top-ranking models perform well under no perturbation but fail when the ego agent is forced to replay the original trajectory. To address these cases, we propose new metrics to highlight the sensitivity of world models to uncontrollable objects and evaluate the performance of world models as pseudo-environments for policy training and analyze some state-of-the-art world models under these new metrics.
Problem

Research questions and friction points this paper is trying to address.

Evaluating world models as traffic simulators for policy training
Assessing robustness of metrics in ego action-conditioned scenarios
Proposing new metrics for world model sensitivity to perturbations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating world models as pseudo-environments for policy training
Extending WOSAC metrics to include causal agent scenarios
Proposing new metrics for world model sensitivity analysis
H
Hunter Schofield
Noah’s Ark Lab, Huawei Technologies Canada
M
Mohammed Elmahgiubi
Noah’s Ark Lab, Huawei Technologies Canada
Kasra Rezaee
Kasra Rezaee
Unknown affiliation
J
Jinjun Shan
York University, Toronto, Canada