UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing I2V evaluation benchmarks prioritize video quality and temporal coherence while neglecting models’ semantic understanding of input images and the physical plausibility and commonsense consistency of generated videos. To address this gap, we propose UI2V-Bench—the first I2V benchmark explicitly designed to assess semantic understanding and causal reasoning capabilities, structured along four dimensions: spatial understanding, attribute binding, category comprehension, and reasoning. We introduce a novel dual-path evaluation framework powered by multimodal large language models (MLLMs): (1) instance-level fine-grained semantic analysis and (2) feedback-driven causal reasoning, jointly modeling semantic consistency, physical plausibility, and temporal logicality. The benchmark comprises ~500 high-quality text–image pairs and systematically evaluates leading open- and closed-source I2V models. MLLM-based automatic assessment achieves strong agreement with human judgments (Spearman’s ρ > 0.92), validating the benchmark’s effectiveness and reliability.

Technology Category

Application Category

📝 Abstract
Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model's ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.
Problem

Research questions and friction points this paper is trying to address.

Evaluates semantic understanding in image-to-video generation models
Assesses reasoning ability for physical laws and commonsense in videos
Addresses gaps in existing benchmarks focusing only on visual quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed UI2V-Bench benchmark for semantic understanding evaluation
Designed MLLM-based pipelines for fine-grained and causal assessment
Incorporated human evaluations aligned with MLLM metrics
🔎 Similar Papers
No similar papers found.
A
Ailing Zhang
Peking University
L
Lina Lei
Huawei Noah’s Ark Lab
D
Dehong Kong
Huawei Noah’s Ark Lab
Zhixin Wang
Zhixin Wang
ZheJiang University
RL systems
J
Jiaqi Xu
Huawei Noah’s Ark Lab
Fenglong Song
Fenglong Song
Huawei Noah’s Ark Lab
C
Chun-Le Guo
Nankai University
C
Chang Liu
Tsinghua University
F
Fan Li
Huawei Noah’s Ark Lab
J
Jie Chen
Peking University