🤖 AI Summary
Existing latent world models suffer from coupled perception reconstruction and planning tasks, limiting planning performance. This paper proposes a planning-oriented latent world modeling paradigm: (1) hierarchical planning decomposition decouples representation learning from decision-making; (2) a local-perception interactive iterative optimization mechanism enhances policy robustness; and (3) we introduce Group Relative Policy Optimization (GRPO), the first algorithm enabling trajectory Gaussianization modeling and collision-aware reward-driven reinforcement fine-tuning. The method integrates vision-geometry foundation models with latent-space temporal self-supervised modeling. On nuScenes open-loop evaluation, collision rate drops by 83% (from 0.30% to 0.05%). In NavSim closed-loop testing—using monocular camera input only—our approach achieves 87.8 PDMS, matching the LiDAR-based SOTA method DiffusionDrive (88.1).
📝 Abstract
Latent World Models enhance scene representation through temporal self-supervised learning, presenting a perception annotation-free paradigm for end-to-end autonomous driving. However, the reconstruction-oriented representation learning tangles perception with planning tasks, leading to suboptimal optimization for planning. To address this challenge, we propose WorldRFT, a planning-oriented latent world model framework that aligns scene representation learning with planning via a hierarchical planning decomposition and local-aware interactive refinement mechanism, augmented by reinforcement learning fine-tuning (RFT) to enhance safety-critical policy performance. Specifically, WorldRFT integrates a vision-geometry foundation model to improve 3D spatial awareness, employs hierarchical planning task decomposition to guide representation optimization, and utilizes local-aware iterative refinement to derive a planning-oriented driving policy. Furthermore, we introduce Group Relative Policy Optimization (GRPO), which applies trajectory Gaussianization and collision-aware rewards to fine-tune the driving policy, yielding systematic improvements in safety. WorldRFT achieves state-of-the-art (SOTA) performance on both open-loop nuScenes and closed-loop NavSim benchmarks. On nuScenes, it reduces collision rates by 83% (0.30% -> 0.05%). On NavSim, using camera-only sensors input, it attains competitive performance with the LiDAR-based SOTA method DiffusionDrive (87.8 vs. 88.1 PDMS).