WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current generative world models suffer from geometric distortions, physical inconsistencies, and unreliable agent behaviors when constructing 4D driving environments, hindering their deployment in autonomous driving; moreover, no unified evaluation framework exists. This paper introduces WorldLens—the first comprehensive, autonomous-driving–oriented evaluation framework—assessing models across five dimensions: generation quality, geometric reconstruction, action following, downstream task performance, and human preference. We release WorldLens-26K, a benchmark dataset comprising 26K human-annotated driving videos. Additionally, we propose WorldLens-Agent, an interpretable and scalable evaluation model integrating multi-scale video quality assessment, joint geometric-physical constraints, preference learning, and knowledge distillation. Experiments uncover fundamental trade-offs among texture fidelity, geometric accuracy, and behavioral realism. WorldLens establishes a standardized benchmark and open-source toolchain, advancing world models toward trustworthiness, controllability, and behaviorally grounded realism.

Technology Category

Application Category

📝 Abstract
Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.
Problem

Research questions and friction points this paper is trying to address.

Evaluates if generated driving worlds are physically and behaviorally realistic.
Assesses world models across geometry, physics, and functional reliability.
Aligns objective metrics with human judgment for scalable, explainable scoring.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces a full-spectrum benchmark for world model evaluation
Creates a large-scale human-annotated dataset with scores and rationales
Develops an agent model for scalable, explainable automated scoring
🔎 Similar Papers
No similar papers found.