EponaV2: Driving World Model with Comprehensive Future Reasoning

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitations of existing autonomous driving world models, which often rely on single-frame image prediction and lack deep environmental understanding, as well as the poor scalability of conventional perception-planning pipelines that depend on costly human annotations. The authors propose a novel driving world model that, for the first time, integrates a human-like driver’s joint 3D geometric and semantic future prediction mechanism with a flow-matching-based grouped relative policy optimization method inspired by large language models. This approach enables efficient modeling of comprehensive future representations and high-quality trajectory planning. Evaluated on three NAVSIM benchmarks, the model achieves state-of-the-art performance among perception-free approaches, improving planning accuracy and environmental reasoning by +1.3 PDMS and +5.5 EPDMS, respectively.

📝 Abstract

Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.

Problem

Research questions and friction points this paper is trying to address.

autonomous driving

world model

trajectory planning

future reasoning

scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

driving world model

future reasoning

3D geometry forecasting