🤖 AI Summary
Evaluating world models as surrogate environments for autonomous driving policy training remains challenging, as existing metrics overlook the causal influence of the ego vehicle on surrounding traffic—a critical factor for assessing robustness in realistic training scenarios.
Method: We propose the first evaluation framework explicitly designed for causal agents and partial replay, grounded in the Waymo Open Sim-Agents Challenge. It introduces a metametric benchmark that quantifies model sensitivity to exogenous behaviors of uncontrollable traffic participants.
Contribution/Results: Experiments reveal that several top-performing world models—deemed robust under standard evaluation—exhibit significant degradation when subjected to causal agent perturbations or partial replay. Our novel metrics effectively discriminate between models in terms of training-time robustness, offering a more policy-relevant assessment criterion than conventional metrics. This framework advances principled, causally grounded evaluation of world models for autonomous driving.
📝 Abstract
World models have become increasingly popular in acting as learned traffic simulators. Recent work has explored replacing traditional traffic simulators with world models for policy training. In this work, we explore the robustness of existing metrics to evaluate world models as traffic simulators to see if the same metrics are suitable for evaluating a world model as a pseudo-environment for policy training. Specifically, we analyze the metametric employed by the Waymo Open Sim-Agents Challenge (WOSAC) and compare world model predictions on standard scenarios where the agents are fully or partially controlled by the world model (partial replay). Furthermore, since we are interested in evaluating the ego action-conditioned world model, we extend the standard WOSAC evaluation domain to include agents that are causal to the ego vehicle. Our evaluations reveal a significant number of scenarios where top-ranking models perform well under no perturbation but fail when the ego agent is forced to replay the original trajectory. To address these cases, we propose new metrics to highlight the sensitivity of world models to uncontrollable objects and evaluate the performance of world models as pseudo-environments for policy training and analyze some state-of-the-art world models under these new metrics.