Is Your Driving World Model an All-Around Player?

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

239K/year
🤖 AI Summary
This work addresses the disconnect between visual realism and physical or behavioral consistency in existing driving world models, as well as the absence of a unified evaluation framework. To bridge this gap, we introduce WorldLens—the first comprehensive benchmark for driving world models that integrates algorithmic metrics with human perceptual judgments across five dimensions: pixel-level quality, 4D geometry, closed-loop driving performance, and alignment with human preferences, totaling 24 distinct metrics. We further present WorldLens-26K, a dataset comprising 26,808 human preference ratings accompanied by rationales, and develop WorldLens-Agent, an interpretable vision-language evaluator for automated assessment. Evaluations across six representative model families reveal no single model dominates all dimensions; notably, even the strongest model achieves only 2–3 out of 10 in human-rated realism, underscoring significant limitations in current approaches.
📝 Abstract
Today's driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.
Problem

Research questions and friction points this paper is trying to address.

driving world model
realism evaluation
physical fidelity
behavioral consistency
human perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

world model evaluation
closed-loop driving
human perceptual alignment
vision-language evaluator
behavioral fidelity
🔎 Similar Papers
No similar papers found.