🤖 AI Summary
Autonomous driving systems exhibit insufficient prediction performance in rare yet safety-critical edge cases—primarily due to training data bias and limited model generalization. To address this, we propose a novel framework integrating a world model, Mixture-of-Experts (MoE), and a large language model (LLM). Our approach introduces a first-of-its-kind *scenario routing mechanism* to decompose complex edge cases, and a lightweight temporal tokenizer enabling zero-shot spatiotemporal context fusion and causal counterfactual reasoning. Notably, this is the first work to leverage LLMs to enhance the long-horizon reasoning capability of world models. To standardize evaluation, we release *nuScenes-corner*, a new benchmark dedicated to edge-case prediction. Experiments demonstrate state-of-the-art performance across four diverse datasets—nuScenes, NGSIM, HighD, and MoCAD—with significant robustness improvements under both edge-case conditions and data scarcity.
📝 Abstract
Accurate and reliable motion forecasting is essential for the safe deployment of autonomous vehicles (AVs), particularly in rare but safety-critical scenarios known as corner cases. Existing models often underperform in these situations due to an over-representation of common scenes in training data and limited generalization capabilities. To address this limitation, we present WM-MoE, the first world model-based motion forecasting framework that unifies perception, temporal memory, and decision making to address the challenges of high-risk corner-case scenarios. The model constructs a compact scene representation that explains current observations, anticipates future dynamics, and evaluates the outcomes of potential actions. To enhance long-horizon reasoning, we leverage large language models (LLMs) and introduce a lightweight temporal tokenizer that maps agent trajectories and contextual cues into the LLM's feature space without additional training, enriching temporal context and commonsense priors. Furthermore, a mixture-of-experts (MoE) is introduced to decompose complex corner cases into subproblems and allocate capacity across scenario types, and a router assigns scenes to specialized experts that infer agent intent and perform counterfactual rollouts. In addition, we introduce nuScenes-corner, a new benchmark that comprises four real-world corner-case scenarios for rigorous evaluation. Extensive experiments on four benchmark datasets (nuScenes, NGSIM, HighD, and MoCAD) showcase that WM-MoE consistently outperforms state-of-the-art (SOTA) baselines and remains robust under corner-case and data-missing conditions, indicating the promise of world model-based architectures for robust and generalizable motion forecasting in fully AVs.