🤖 AI Summary
Vision-centric end-to-end autonomous driving suffers from redundant modeling of static backgrounds and weak prediction of dynamic scene evolution in world models. Method: We propose the Implicit Residual World Model (IR-WM), which integrates bird’s-eye-view (BEV) representation, implicit residual prediction, temporal prior modeling, and dynamic semantic alignment to jointly perform 4D occupancy forecasting and trajectory planning from monocular video inputs. Contributions/Results: IR-WM introduces (i) residual BEV temporal modeling—predicting only dynamic state changes rather than full scene reconstruction; (ii) a semantic alignment module to suppress temporal error accumulation; and (iii) joint optimization coupling between the world model and planner. Evaluated on nuScenes, IR-WM achieves state-of-the-art performance in both 4D occupancy prediction and trajectory planning, significantly improving long-horizon prediction stability and planning accuracy.
📝 Abstract
End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.