AirScape: An Aerial Generative World Model with Motion Controllability

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the foundational problem of predicting 3D motion intentions in embodied intelligence by proposing the first generative world model for aerial environments supporting six-degree-of-freedom (6-DoF) motion controllability. Methodologically: (1) we construct a large-scale, manually annotated dataset comprising 11,000 first-person video–motion-intention pairs; (2) we introduce a two-stage training strategy that jointly incorporates physics-informed spatiotemporal constraints and vision–intention alignment to endow the model with spatial imagination capability. Our key contribution is the first demonstration of physically plausible, long-horizon dynamic observation sequence generation conditioned jointly on visual input and fine-grained 6-DoF motion commands. Extensive evaluation across diverse aerial scenarios confirms the model’s high-fidelity response to complex motion intentions, establishing a scalable world modeling paradigm for spatial reasoning and autonomous planning in embodied agents.

Technology Category

Application Category

📝 Abstract

How to enable robots to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore more general spatial imagination capabilities, here we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct an dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase training schedule to train a foundation model -- initially devoid of embodied spatial knowledge -- into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints.

Problem

Research questions and friction points this paper is trying to address.

Enable robots to predict 3D motion outcomes

Develop first aerial world model for 6DOF agents

Train model with visual inputs and motion intentions

Innovation

Methods, ideas, or system contributions that make the work stand out.

First world model for six-degree-of-freedom aerial agents

Dataset with 11k video-intention pairs for training

Two-phase training schedule for motion-controllable model

🔎 Similar Papers

No similar papers found.