MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing world models struggle to maintain spatiotemporal physical consistency under highly dynamic drone viewpoints, primarily due to the absence of realistic 6-degree-of-freedom (6-DoF) motion priors. To bridge this gap, the authors introduce the first large-scale dataset of high-dynamic drone videos, comprising over 30 hours of 4K footage accompanied by precise 6-DoF camera trajectories and fine-grained language descriptions. They further propose an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM, and large language model–driven semantic annotation to achieve geometric and semantic alignment. Experiments demonstrate that this dataset substantially enhances world models’ ability to simulate complex 3D dynamics and large viewpoint variations, thereby improving decision-making and planning capabilities of drone agents in challenging environments.
📝 Abstract
Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision-making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground-constrained autonomous driving scenes or relatively smooth human-centric egocentric videos, and therefore lack realistic high-dynamic 6-DoF UAV motion priors. To address this gap, we present MotionScape, a large-scale real-world UAV-view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV-view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real-world UAV videos are tightly coupled with accurate 6-DoF camera trajectories and fine-grained natural language descriptions. To build the dataset, we develop an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large-language-model-driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision-making and planning for UAV agents in complex environments. The dataset is publicly available at https://github.com/Thelegendzz/MotionScape
Problem

Research questions and friction points this paper is trying to address.

world models
UAV
highly dynamic motion
6-DoF trajectories
dataset bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

world models
UAV dataset
6-DoF trajectory
semantic-geometric alignment
dynamic scene modeling
🔎 Similar Papers
No similar papers found.
Z
Zile Guo
Aerospace Information Research Institute, Chinese Academy of Sciences
Zhan Chen
Zhan Chen
Georgia Southern University
Mathematical modeling in biology and scientific computing
E
Enze Zhu
Aerospace Information Research Institute, Chinese Academy of Sciences
K
Kan Wei
Aerospace Information Research Institute, Chinese Academy of Sciences
Y
Yongkang Zou
Jilin University
X
Xiaoxuan Liu
Aerospace Information Research Institute, Chinese Academy of Sciences
Lei Wang
Lei Wang
ICT,CAS