Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inaccurate spatial perception, poor robustness in long-tail scenarios, and weak spatial reasoning in vision-language-action (VLA) systems for autonomous driving, this paper proposes an end-to-end VLA model that—uniquely—implicitly unifies 2D and 3D scene understanding within a single vision-language model (VLM). We introduce World-PV and World-BEV spatial tokens jointly encoding coordinates and confidence, and design a grid-conditioned prediction mechanism with parallel autoregressive decoding to enhance localization stability for distant objects, small targets, and complex interactions. Leveraging IoU-aware scoring, dense object perception, joint BEV/PV modeling, and trajectory decoder co-optimization, our model achieves 51.7/58.9 mAP on COCO and nuScenes, surpassing classical detectors. In planning benchmarks, it outperforms DiffusionDrive on nuScenes and NAVSIM, improving PMDS by 2.1. The approach significantly advances open-vocabulary comprehension and long-tail generalization.

Technology Category

Application Category

📝 Abstract
Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.
Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial perception accuracy for autonomous driving systems
Improving robustness in long-tail scenarios and complex interactions
Overcoming limitations of current vision-language models in spatial grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicitly integrates 2D/3D scene understanding in VLM
Uses grid-conditioned prediction with IoU-aware scoring
Leverages pretrained VLM parameters for general intelligence
🔎 Similar Papers
No similar papers found.
Jianhua Han
Jianhua Han
2030 Research, YinWang, Huawei
Vision Language ModelFoundation ModelVLA
M
Meng Tian
Yinwang Intelligent Technology Co. Ltd.
Jiangtong Zhu
Jiangtong Zhu
XJTU
F
Fan He
Yinwang Intelligent Technology Co. Ltd.
Huixin Zhang
Huixin Zhang
Shanghai University
Network ScienceNetwork ResilienceStructure and DynamicsCoupled Networks
S
Sitong Guo
Yinwang Intelligent Technology Co. Ltd.
D
Dechang Zhu
Yinwang Intelligent Technology Co. Ltd.
H
Hao Tang
Yinwang Intelligent Technology Co. Ltd.
P
Pei Xu
Yinwang Intelligent Technology Co. Ltd.
Y
Yuze Guo
Yinwang Intelligent Technology Co. Ltd.
M
Minzhe Niu
Yinwang Intelligent Technology Co. Ltd.
Haojie Zhu
Haojie Zhu
University of Michigan
Control and optimization
Q
Qichao Dong
Yinwang Intelligent Technology Co. Ltd.
X
Xuechao Yan
Yinwang Intelligent Technology Co. Ltd.
Siyuan Dong
Siyuan Dong
Postdoc @University of Washington,PhD @MIT
Robotic manipulationTactile sensingMachine learningComputational imaging
L
Lu Hou
Yinwang Intelligent Technology Co. Ltd.
Qingqiu Huang
Qingqiu Huang
Yinwang Intelligent Technology Co. Ltd.
Xiaosong Jia
Xiaosong Jia
Assistant Professor, Institute of Trustworthy Embodied AI (TEAI), Fudan University
Embodied AIAutonomous DrivingWorld ModelReinforcement Learning
H
Hang Xu
Yinwang Intelligent Technology Co. Ltd.