🤖 AI Summary
This work addresses the disconnect between visual realism and physical or behavioral consistency in existing driving world models, as well as the absence of a unified evaluation framework. To bridge this gap, we introduce WorldLens—the first comprehensive benchmark for driving world models that integrates algorithmic metrics with human perceptual judgments across five dimensions: pixel-level quality, 4D geometry, closed-loop driving performance, and alignment with human preferences, totaling 24 distinct metrics. We further present WorldLens-26K, a dataset comprising 26,808 human preference ratings accompanied by rationales, and develop WorldLens-Agent, an interpretable vision-language evaluator for automated assessment. Evaluations across six representative model families reveal no single model dominates all dimensions; notably, even the strongest model achieves only 2–3 out of 10 in human-rated realism, underscoring significant limitations in current approaches.
📝 Abstract
Today's driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.