RISE-Video: Can Video Generators Decode Implicit World Rules?

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the notable limitations of current generative video models in understanding and reasoning about implicit world rules—such as commonsense knowledge, spatial dynamics, and domain-specific constraints—and the absence of dedicated evaluation frameworks for such capabilities. To bridge this gap, we propose the first benchmark for implicit rule reasoning in text-and-image-to-video (TI2V) generation, introducing a four-dimensional evaluation protocol that assesses reasoning alignment, temporal consistency, physical plausibility, and visual quality. We further develop an automated evaluation pipeline powered by large language-multimodal models. Comprehensive experiments across 11 state-of-the-art TI2V models reveal widespread deficiencies in handling complex implicit constraints, offering critical insights and a foundational tool for advancing embodied intelligence and world simulation research.

Technology Category

Application Category

📝 Abstract
While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
Problem

Research questions and friction points this paper is trying to address.

implicit world rules
video generation
reasoning
text-to-video
cognitive evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning-oriented benchmark
implicit world rules
Text-Image-to-Video synthesis
multimodal evaluation
Large Multimodal Models
🔎 Similar Papers
No similar papers found.
M
Mingxin Liu
Shanghai Jiao Tong University
S
Shuran Ma
Xidian University
S
Shibei Meng
Beijing Normal University
Xiangyu Zhao
Xiangyu Zhao
Shanghai Jiao Tong University
Medical Image ComputingImage SegmentationComputer Vision
Z
Zicheng Zhang
Shanghai Jiao Tong University
Shaofeng Zhang
Shaofeng Zhang
Shanghai Jiao Tong University
machine learning
Zhihang Zhong
Zhihang Zhong
Researcher, Shanghai AI Laboratory
Computer visionDeep learning
P
Pei-Pei Chen
Tencent Youtu Lab
H
Haoyu Cao
Tencent Youtu Lab
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
Haodong Duan
Haodong Duan
Shanghai AI Lab | CUHK | PKU
Computer VisionVideo UnderstandingMultimodal LearningGenerative AI
X
Xue Yang
Shanghai Jiao Tong University