Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the limitation of existing end-to-end autonomous driving methods, which struggle to capture the multimodality of driving behaviors due to reliance on a single human trajectory, thereby constraining scene understanding and planning capabilities. The authors propose Drive-JEPA, the first framework to adapt the Video Joint-Embedding Predictive Architecture (V-JEPA) to end-to-end driving. It leverages a self-supervised pre-trained Vision Transformer (ViT) encoder and a momentum-aware multimodal trajectory distillation mechanism to learn planning-oriented predictive visual representations. A proposal-centric planner integrates both simulated and human trajectories to enhance policy diversity and safety. On the NAVSIM benchmark, using only V-JEPA representations with a simple decoder surpasses prior methods by 3 PDMS; the full framework achieves new state-of-the-art results of 93.3 PDMS and 87.8 EPDMS on v1 and v2, respectively.

Technology Category

Application Category

📝 Abstract

End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.

Problem

Research questions and friction points this paper is trying to address.

end-to-end driving

multimodal behavior

video pretraining

trajectory ambiguity

scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video JEPA

multimodal trajectory distillation

end-to-end driving