From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the decoupling of world modeling and trajectory planning in autonomous driving by proposing the Policy World Model (PWM)—an end-to-end framework that unifies modeling and planning. PWM introduces an action-free future state prediction mechanism to enable human-like anticipatory perception; designs a context-guided tokenizer and a dynamic enhancement parallel token generation module to improve video prediction efficiency and planning robustness; and employs an adaptive dynamic focal loss for training optimization. Relying solely on a single front-facing camera input, PWM achieves planning performance comparable to or surpassing state-of-the-art multi-view, multimodal methods on benchmarks such as nuScenes. This work constitutes the first empirical validation that a lightweight monocular world model can synergistically enhance trajectory planning—demonstrating that unified world modeling and planning is both feasible and effective under resource-constrained sensory inputs.

Technology Category

Application Category

📝 Abstract
Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.
Problem

Research questions and friction points this paper is trying to address.

Unifies world modeling and trajectory planning in autonomous driving
Enhances planning reliability through collaborative state-action prediction
Improves video forecasting efficiency with dynamic token generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified world modeling and trajectory planning architecture
Action-free future state forecasting for planning
Dynamically enhanced parallel token generation mechanism
🔎 Similar Papers
No similar papers found.
Z
Zhida Zhao
Dalian University of Technology
T
Talas Fu
Dalian University of Technology
Y
Yifan Wang
Dalian University of Technology
Lijun Wang
Lijun Wang
Zhejiang University
Statistical LearningBioinformaticsAstrophysics
H
Huchuan Lu
Dalian University of Technology