DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses insufficient spatiotemporal consistency in video generation—particularly under complex camera motions—leading to narrative discontinuity, object incoherence, and cross-view distortion. To this end, we propose the first framework jointly modeling dynamic camera motion and subject action. Methodologically, we introduce DropletVideo-10M, a large-scale multiview video dataset with fine-grained, 206-token-per-video annotations, and formally define “global spatiotemporal consistency” as a unified objective integrating camera motion semantics, object behavioral evolution, and long-range narrative dependencies. We further design a camera-motion–action co-modeling architecture with an end-to-end spatiotemporal alignment training strategy. Experiments demonstrate that the DropletVideo model significantly improves object persistence, shot-level logical coherence, and inter-frame narrative consistency, outperforming state-of-the-art methods across multiple quantitative metrics. The dataset and code are publicly released.

Technology Category

Application Category

📝 Abstract
Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation. The DropletVideo dataset and model are accessible at https://dropletx.github.io.
Problem

Research questions and friction points this paper is trying to address.

Ensures spatio-temporal consistency in video generation.
Addresses complex interplay between multiple plots and camera movements.
Develops a model for coherent video generation with dynamic camera motion.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integral spatio-temporal consistency in video generation
DropletVideo-10M dataset with dynamic camera motion
DropletVideo model for spatio-temporal coherence
🔎 Similar Papers
No similar papers found.
R
Runze Zhang
IEIT System Co., Ltd.
Guoguang Du
Guoguang Du
Inspur(Beijing) Electronic Information Industry Co., Ltd
3D VisionComputer GraphicsComputer Vision
Xiaochuan Li
Xiaochuan Li
Carnegie Mellon University
Machine LearningNatural Language Processing
Q
Qi Jia
IEIT System Co., Ltd.
Liang Jin
Liang Jin
IEIT System Co., Ltd.
L
Lu Liu
IEIT System Co., Ltd.
Jingjing Wang
Jingjing Wang
Professor, School of Cyber Science and Technology, Beihang University
AI for WirelessUAV NetworksSpace-Air-Ground-Sea NetworksCommunication Security
C
Cong Xu
IEIT System Co., Ltd.
Z
Zhenhua Guo
IEIT System Co., Ltd.
Y
Yaqian Zhao
IEIT System Co., Ltd.
X
Xiaoli Gong
Nankai University
R
Rengang Li
IEIT System Co., Ltd., Tsinghua University
B
Baoyu Fan
IEIT System Co., Ltd., Nankai University