OmniNWM: Omniscient Driving Navigation World Models

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current autonomous driving world models suffer from limitations including single-state modality, short video sequences, coarse-grained action control, and absence of explicit reward modeling—hindering joint modeling of state, action, and reward. This paper introduces the first unified world model for autonomous driving: it enables pixel-level trajectory control via a panoramic Plücker ray representation; constructs a regularized, dense, and differentiable reward function through generative 3D occupancy prediction; and performs multimodal joint modeling over RGB, semantic, depth, and 3D occupancy inputs. The framework supports long-horizon autoregressive generation and closed-loop navigation evaluation. Experiments demonstrate state-of-the-art performance in video fidelity, action accuracy, and long-term stability, while significantly improving simulation capabilities for driving compliance and safety.

Technology Category

Application Category

📝 Abstract
Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at https://github.com/Arlo0o/OmniNWM.
Problem

Research questions and friction points this paper is trying to address.

Generating panoramic videos with RGB, semantics, depth and 3D occupancy
Enabling precise control over panoramic video generation using trajectory encoding
Defining rule-based dense rewards for driving compliance and safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates panoramic videos with multiple state modalities
Encodes trajectories into pixel-level signals for precise control
Defines dense rewards using generated 3D occupancy
B
Bohan Li
Shanghai Jiao Tong University, Eastern Institute of Technology, Ningbo
Zhuang Ma
Zhuang Ma
The Wharton School, University of Pennsylvania
Machine LearningStatistics
D
Dalong Du
PhiGent
B
Baorui Peng
Eastern Institute of Technology, Ningbo
Zhujin Liang
Zhujin Liang
Bigo Live
Computer VisionMachine LearningDeep Learning
Z
Zhenqiang Liu
PhiGent
C
Chao Ma
Shanghai Jiao Tong University
Yueming Jin
Yueming Jin
Assistant Professor, National University of Singapore
Medical Image AnalysisSurgical AI&RoboticsMultimodal Learning
H
Hao Zhao
Tsinghua University
W
Wenjun Zeng
Eastern Institute of Technology, Ningbo
X
Xin Jin
Eastern Institute of Technology, Ningbo