🤖 AI Summary
Existing end-to-end autonomous driving world models suffer from redundant modeling of static regions and insufficient interaction between trajectories and scene dynamics, limiting planning performance. To address these issues, this work proposes the Temporal Residual World Model (TR-World), which directly extracts dynamic object information through temporal residuals without relying on explicit detection or tracking, and predicts high-fidelity future bird’s-eye-view (BEV) representations by leveraging current BEV features. Furthermore, a Future-Guided Trajectory Refinement module (FGTR) is introduced to enable bidirectional co-optimization between trajectories and future scene context, while sparse spatiotemporal supervision is employed to prevent training instability. Evaluated on nuScenes and NAVSIM, the proposed approach significantly improves planning accuracy and robustness, achieving state-of-the-art performance.
📝 Abstract
The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at https://github.com/mengtan00/ResWorld.git.