π€ AI Summary
Diffusion models for autonomous driving planning suffer from high latency due to iterative sampling and weak high-level semantic representation when modeling trajectories directly in the raw spaceβoften collapsing into low-level kinematic patterns. To address this, we propose a latent-space single-step denoising planning framework. First, a disentangled VAE constructs a low-dimensional planning latent space that explicitly separates semantic intent from motion dynamics. Second, a single-step diffusion denoiser operates within this latent space, eliminating the need for iterative sampling. Third, a fine-grained scene feature distillation mechanism is introduced to explicitly align high-level planning decisions with contextual semantic cues. Evaluated in closed-loop nuPlan benchmarks, our method achieves state-of-the-art performance among learning-based planners. It accelerates inference by up to 10Γ over prior diffusion-based approaches, while preserving multimodality, planning efficiency, and semantic consistency.
π Abstract
Diffusion models have demonstrated strong capabilities for modeling human-like driving behaviors in autonomous driving, but their iterative sampling process induces substantial latency, and operating directly on raw trajectory points forces the model to spend capacity on low-level kinematics, rather than high-level multi-modal semantics. To address these limitations, we propose LAtent Planner (LAP), a framework that plans in a VAE-learned latent space that disentangles high-level intents from low-level kinematics, enabling our planner to capture rich, multi-modal driving strategies. We further introduce a fine-grained feature distillation mechanism to guide a better interaction and fusion between the high-level semantic planning space and the vectorized scene context. Notably, LAP can produce high-quality plans in one single denoising step, substantially reducing computational overhead. Through extensive evaluations on the large-scale nuPlan benchmark, LAP achieves state-of-the-art closed-loop performance among learning-based planning methods, while demonstrating an inference speed-up of at most 10 times over previous SOTA approaches.