AirScape: An Aerial Generative World Model with Motion Controllability

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the foundational problem of predicting 3D motion intentions in embodied intelligence by proposing the first generative world model for aerial environments supporting six-degree-of-freedom (6-DoF) motion controllability. Methodologically: (1) we construct a large-scale, manually annotated dataset comprising 11,000 first-person video–motion-intention pairs; (2) we introduce a two-stage training strategy that jointly incorporates physics-informed spatiotemporal constraints and vision–intention alignment to endow the model with spatial imagination capability. Our key contribution is the first demonstration of physically plausible, long-horizon dynamic observation sequence generation conditioned jointly on visual input and fine-grained 6-DoF motion commands. Extensive evaluation across diverse aerial scenarios confirms the model’s high-fidelity response to complex motion intentions, establishing a scalable world modeling paradigm for spatial reasoning and autonomous planning in embodied agents.

Technology Category

Application Category

📝 Abstract
How to enable robots to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore more general spatial imagination capabilities, here we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct an dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase training schedule to train a foundation model -- initially devoid of embodied spatial knowledge -- into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints.
Problem

Research questions and friction points this paper is trying to address.

Enable robots to predict 3D motion outcomes
Develop first aerial world model for 6DOF agents
Train model with visual inputs and motion intentions
Innovation

Methods, ideas, or system contributions that make the work stand out.

First world model for six-degree-of-freedom aerial agents
Dataset with 11k video-intention pairs for training
Two-phase training schedule for motion-controllable model
🔎 Similar Papers
No similar papers found.
Baining Zhao
Baining Zhao
Tsinghua University
R
Rongze Tang
Tsinghua University
M
Mingyuan Jia
Tsinghua University
Z
Ziyou Wang
Tsinghua University
F
Fanghang Man
Tsinghua University
X
Xin Zhang
Tsinghua University
Yu Shang
Yu Shang
Department of Electronic Engineering, Tsinghua University
Multimodal LearningLLM AgentRecommender System
Weichen Zhang
Weichen Zhang
PhD, University of Sydney
Computer VisionDeep LearningTransfer LearningDomain Adaptation
C
Chen Gao
Tsinghua University
W
Wei Wu
Tsinghua University
X
Xin Wang
Tsinghua University
X
Xinlei Chen
Tsinghua University
Y
Yong Li
Tsinghua University