The DAWN of World-Action Interactive Models

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing motion generation approaches often decouple scene evolution from action decision-making, neglecting their dynamic interdependence. This work proposes the World-Action Interaction Model (WAIMs), instantiated in autonomous driving as DAWN, which for the first time formally models the recursive interaction between world states and actions. The method jointly optimizes a world predictor and an action denoiser within a semantic latent space, enabling cooperative long-horizon trajectory generation through recursive latent state updates and short-horizon explicit rollouts. Experiments demonstrate that DAWN significantly improves planning performance and safety across multiple autonomous driving benchmarks, validating the efficacy of interactive world-action modeling.

📝 Abstract

A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines. We formalize this perspective as World-Action Interactive Models (WAIMs), and instantiate it in autonomous driving with \textbf{DAWN} (\textbf{D}enoising \textbf{A}ctions and \textbf{W}orld i\textbf{N}teractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a \emph{World Predictor} with a \emph{World-Conditioned Action Denoiser}: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test-time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world-action generation is a principled path toward truly actionable world models.

Problem

Research questions and friction points this paper is trying to address.

World-Action Interaction

Autonomous Driving

Scene Evolution

Action Generation

World Prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

World-Action Interaction

Latent Generative Model

Action Denoising