Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

To address inaccurate spatial perception, poor robustness in long-tail scenarios, and weak spatial reasoning in vision-language-action (VLA) systems for autonomous driving, this paper proposes an end-to-end VLA model that—uniquely—implicitly unifies 2D and 3D scene understanding within a single vision-language model (VLM). We introduce World-PV and World-BEV spatial tokens jointly encoding coordinates and confidence, and design a grid-conditioned prediction mechanism with parallel autoregressive decoding to enhance localization stability for distant objects, small targets, and complex interactions. Leveraging IoU-aware scoring, dense object perception, joint BEV/PV modeling, and trajectory decoder co-optimization, our model achieves 51.7/58.9 mAP on COCO and nuScenes, surpassing classical detectors. In planning benchmarks, it outperforms DiffusionDrive on nuScenes and NAVSIM, improving PMDS by 2.1. The approach significantly advances open-vocabulary comprehension and long-tail generalization.

Technology Category

Application Category

📝 Abstract

Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial perception accuracy for autonomous driving systems

Improving robustness in long-tail scenarios and complex interactions

Overcoming limitations of current vision-language models in spatial grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicitly integrates 2D/3D scene understanding in VLM

Uses grid-conditioned prediction with IoU-aware scoring

Leverages pretrained VLM parameters for general intelligence

🔎 Similar Papers

Enhancing End-to-End Autonomous Driving with Latent World Model