TrajDiff: End-to-end Autonomous Driving without Perception Annotation

📅 2025-11-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottleneck of reliance on costly manual perception annotations in end-to-end autonomous driving, this paper proposes TrajDiff—the first fully annotation-free generative driving framework. Methodologically, it introduces a trajectory-oriented BEV diffusion paradigm, comprising an unsupervised TrajBEV encoder and a BEV Diffusion Transformer (TB-DiT). Instead of using perception labels or handcrafted motion priors, TrajDiff directly models the multimodal distribution of future ego-vehicle trajectories conditioned solely on raw sensor inputs and ego-state, representing driving modes via Gaussian BEV heatmaps. Crucially, it eliminates explicit perception modules and engineered motion assumptions. Evaluated on NAVSIM, TrajDiff achieves 87.5 PDMS, improving to 88.5 PDMS with data augmentation—surpassing all prior annotation-free methods and matching state-of-the-art perception-dependent approaches.

Technology Category

Application Category

📝 Abstract
End-to-end autonomous driving systems directly generate driving policies from raw sensor inputs. While these systems can extract effective environmental features for planning, relying on auxiliary perception tasks, developing perception annotation-free planning paradigms has become increasingly critical due to the high cost of manual perception annotation. In this work, we propose TrajDiff, a Trajectory-oriented BEV Conditioned Diffusion framework that establishes a fully perception annotation-free generative method for end-to-end autonomous driving. TrajDiff requires only raw sensor inputs and future trajectory, constructing Gaussian BEV heatmap targets that inherently capture driving modalities. We design a simple yet effective trajectory-oriented BEV encoder to extract the TrajBEV feature without perceptual supervision. Furthermore, we introduce Trajectory-oriented BEV Diffusion Transformer (TB-DiT), which leverages ego-state information and the predicted TrajBEV features to directly generate diverse yet plausible trajectories, eliminating the need for handcrafted motion priors. Beyond architectural innovations, TrajDiff enables exploration of data scaling benefits in the annotation-free setting. Evaluated on the NAVSIM benchmark, TrajDiff achieves 87.5 PDMS, establishing state-of-the-art performance among all annotation-free methods. With data scaling, it further improves to 88.5 PDMS, which is comparable to advanced perception-based approaches. Our code and model will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Develops perception annotation-free end-to-end autonomous driving
Generates diverse plausible trajectories without handcrafted motion priors
Explores data scaling benefits in annotation-free autonomous driving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory-oriented BEV encoder without perceptual supervision
Trajectory-oriented BEV Diffusion Transformer for diverse trajectory generation
Fully perception annotation-free generative method for autonomous driving
🔎 Similar Papers
No similar papers found.
X
Xingtai Gui
SKL-IOTSC, CIS, University of Macau
J
Jianbo Zhao
University of Science and Technology of China
Wencheng Han
Wencheng Han
University of Macau
Medical Image AnalysisDepth EstimationAutomatic DriveMachine Learning
Jikai Wang
Jikai Wang
University of Texas at Dallas
Computer VisionRoboticsMachine Learning
J
Jiahao Gong
Mach Drive
F
Feiyang Tan
Mach Drive
C
Cheng-zhong Xu
SKL-IOTSC, CIS, University of Macau
Jianbing Shen
Jianbing Shen
Professor, University of Macau
Computer VisionMedical Image AnalysisVision and LanguageSelf-Driving CarsAI in Healthcare