LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Current evaluation metrics for video generation predominantly emphasize perceptual quality, falling short in assessing physical plausibility, temporal coherence, and alignment with conditioning inputs. To address this gap, this work proposes the first comprehensive four-dimensional evaluation framework tailored for generative 4D world models. The framework jointly evaluates video quality, physical realism, condition consistency, and temporal coherence, and further introduces a fine-grained task for localizing physically anomalous timestamps. A high-quality benchmark dataset is constructed through human annotation and automated quality control, encompassing 1,554 videos across 26 physical scenarios and spanning three generation modalities: text-to-2D, image-to-4D, and video-to-4D. Evaluation employs a composite protocol combining TimeStamp_IOU with SRCC/PLCC metrics, advancing the development of reliable assessment methodologies for generative world models.

📝 Abstract

This paper reports on the LoViF 2026 PhyScore challenge, a competition on holistic quality assessment of world-model-generated videos across both 2D and 4D generation settings. The challenge is motivated by a central gap in current evaluation practice: perceptual quality alone is insufficient to judge whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions. Participants are required to build a metric that jointly predicts four dimensions, i.e., Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency. Depart from that, participants also need to localize physical anomaly timestamps for fine-grained diagnosis. The benchmark dataset contains 1,554 videos generated by seven representative world generative models, organized into three tracks (text-2D, image-to-4D, and video-to-4D) and spanning 26 categories. These categories explicitly cover physics-relevant scenarios, including dynamics, optics, and thermodynamics, together with diverse real-world and creative content. To ensure label reliability, scores and anomaly timestamps are produced through trained human annotation with an additional automated quality-control pass. Evaluation is based on both score prediction and anomaly localization, with a composite protocol that combines TimeStamp_IOU and SRCC/PLCC. This report summarizes the challenge design and provides method-level insights from submitted solutions.

Problem

Research questions and friction points this paper is trying to address.

holistic quality assessment

physical realism

temporal consistency

condition-video alignment

4D world model

Innovation

Methods, ideas, or system contributions that make the work stand out.

holistic quality assessment

4D world model

physical realism