Is Your Driving World Model an All-Around Player?

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the disconnect between visual realism and physical or behavioral consistency in existing driving world models, as well as the absence of a unified evaluation framework. To bridge this gap, we introduce WorldLens—the first comprehensive benchmark for driving world models that integrates algorithmic metrics with human perceptual judgments across five dimensions: pixel-level quality, 4D geometry, closed-loop driving performance, and alignment with human preferences, totaling 24 distinct metrics. We further present WorldLens-26K, a dataset comprising 26,808 human preference ratings accompanied by rationales, and develop WorldLens-Agent, an interpretable vision-language evaluator for automated assessment. Evaluations across six representative model families reveal no single model dominates all dimensions; notably, even the strongest model achieves only 2–3 out of 10 in human-rated realism, underscoring significant limitations in current approaches.

📝 Abstract

Today's driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.

Problem

Research questions and friction points this paper is trying to address.

driving world model

realism evaluation

physical fidelity

behavioral consistency

human perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

world model evaluation

closed-loop driving

human perceptual alignment