Adapting Vision-Language Models for Evaluating World Models

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing evaluation metrics for world models lack fine-grained characterization of action alignment and semantic consistency during dynamic rollouts. Method: We propose UNIVERSE—the first unified vision-language modeling framework tailored for multimodal, time-sensitive evaluation of world models across diverse output formats. It formalizes action–semantic consistency assessment as three complementary tasks: binary classification, multiple-choice, and open-ended question answering. UNIVERSE employs full, partial, or parameter-efficient fine-tuning strategies to optimize multimodal reasoning under varying context lengths, sampling schemes, and data compositions. Results: A single checkpoint of UNIVERSE matches or exceeds specialized baselines across multiple quantitative metrics. Human evaluation confirms high agreement between UNIVERSE’s judgments and expert annotations (Cohen’s κ = 0.87), demonstrating strong semantic awareness, robust generalizability, and scalability to diverse world model architectures and rollout configurations.

Technology Category

Application Category

📝 Abstract

World models -- generative models that simulate environment dynamics conditioned on past observations and actions -- are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency -- capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce a evaluation protocol targeting two recognition tasks -- action recognition and character recognition -- each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a method for adapting VLMs to rollout evaluation under data and compute constraints. We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions. The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint. Human studies confirm strong alignment with human judgments, establishing UNIVERSE as a scalable, semantics-aware evaluator for world models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating world models' action alignment and semantic consistency

Adapting Vision-Language Models for fine-grained temporal evaluation

Developing a unified evaluator for scalable world model assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting VLMs for fine-grained rollout evaluation

Introducing UNIVERSE for unified VLM adaptation

Matching task-specific baselines with single checkpoint

🔎 Similar Papers

Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models