🤖 AI Summary
This work addresses the performance limitations of existing world model–based autonomous driving planners under constrained data and computational resources, which stem from insufficient representation compression, limited spatial understanding, and inadequate modeling of temporal dynamics. To overcome these challenges, we propose Latent-WAM, an end-to-end framework that integrates a spatial-aware learnable query compression mechanism with a causal Transformer–driven Dynamic Latent World Model (DLWM). The Spatial-Aware Compressive World Encoder (SCWE), enhanced by geometric knowledge distillation, enriches spatial semantics, while autoregressive future state prediction improves temporal consistency. Evaluated on NAVSIM v2 and HUGSIM, our method achieves 89.3 EPDMS and 28.9 HD-Score, respectively, surpassing the previous best perception-free approach by 3.2 EPDMS with only 104M parameters and less training data.
📝 Abstract
We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.