WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of vision-and-language navigation in aerial environments, where an agent must execute natural language instructions to navigate through 3D spaces via closed-loop perception and action. We propose the first autoregressive world-action model for this task, formulating navigation as a prediction-driven world-action problem: leveraging a latent autoregressive video backbone and an instruction-conditioned dynamics prior, the model predicts short-horizon world-state transitions and directly decodes them into executable waypoint actions. A two-stage training framework is introduced alongside an action-aware GRPO reinforcement learning algorithm. Evaluated on public indoor and outdoor benchmarks, our approach improves success rates by over 12%, demonstrates pronounced advantages in complex scenarios, and achieves zero-shot deployment on real-world drones.

📝 Abstract

Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12\%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at https://embodiedcity.github.io/WorldVLN/.

Problem

Research questions and friction points this paper is trying to address.

aerial vision-language navigation

world action model

autoregressive prediction

closed-loop perception and action

3D environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

autoregressive world model

vision-language navigation

action-aware reinforcement learning