Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing video generation models struggle to produce physically plausible aerial flight videos from low-level inertial signals such as acceleration and angular velocity, limiting their applicability in embodied intelligence. This work proposes Aero-World, a method that transforms a pretrained image-to-video diffusion model into a controllable aerial video generator by injecting IMU signals via an action token stream. It leverages a frozen, differentiable physics probe to provide inertia-consistency supervision without requiring full video decoding and is trained using a Latent Diffusion Transformer architecture with LoRA fine-tuning. The authors also introduce AeroBench, a new evaluation benchmark. Experiments demonstrate that Aero-World significantly outperforms AirScape, achieving higher action alignment scores (63.6 vs. 57.7), lower FVD (596.5), improved SSIM (0.595), and stronger Flow-IMU correlation (0.44).

📝 Abstract

Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video--IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.

Problem

Research questions and friction points this paper is trying to address.

aerial video generation

inertial controls

embodied AI

6-DoF motion

action-conditioned generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

action-conditioned video generation

inertial control

latent diffusion transformer