MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the limitation of reliance on high-definition (HD) maps in end-to-end autonomous driving trajectory planning. We propose MTR-VP, a map-free framework that encodes visual inputs and historical motion states using Vision Transformers (ViTs) to generate map-free visual context embeddings. A cross-attention mechanism fuses planning intent with these embeddings for multimodal trajectory prediction. Our key contribution lies in eliminating explicit map modeling and instead learning intent-guided visual representations to enhance scene understanding robustness under map-free conditions. Experiments demonstrate that MTR-VP’s multi-modal trajectory distribution prediction significantly outperforms both deterministic single-trajectory baselines and existing feature-fusion approaches. Evaluated on the Waymo Open Dataset, even strong pretrained visual encoders (e.g., CLIP or DINOv2) fail to compensate for the absence of HD maps when only fused with motion features; in contrast, MTR-VP achieves superior planning performance by jointly modeling intent and visual-motor context.

Technology Category

Application Category

📝 Abstract

We present a method for trajectory planning for autonomous driving, learning image-based context embeddings that align with motion prediction frameworks and planning-based intention input. Within our method, a ViT encoder takes raw images and past kinematic state as input and is trained to produce context embeddings, drawing inspiration from those generated by the recent MTR (Motion Transformer) encoder, effectively substituting map-based features with learned visual representations. MTR provides a strong foundation for multimodal trajectory prediction by localizing agent intent and refining motion iteratively via motion query pairs; we name our approach MTR-VP (Motion Transformer for Vision-based Planning), and instead of the learnable intention queries used in the MTR decoder, we use cross attention on the intent and the context embeddings, which reflect a combination of information encoded from the driving scene and past vehicle states. We evaluate our methods on the Waymo End-to-End Driving Dataset, which requires predicting the agent's future 5-second trajectory in bird's-eye-view coordinates using prior camera images, agent pose history, and routing goals. We analyze our architecture using ablation studies, removing input images and multiple trajectory output. Our results suggest that transformer-based methods that are used to combine the visual features along with the kinetic features such as the past trajectory features are not effective at combining both modes to produce useful scene context embeddings, even when intention embeddings are augmented with foundation-model representations of scene context from CLIP and DINOv2, but that predicting a distribution over multiple futures instead of a single future trajectory boosts planning performance.

Problem

Research questions and friction points this paper is trying to address.

Develops vision-based trajectory planning for autonomous driving

Replaces map features with learned image context embeddings

Evaluates multi-trajectory prediction to enhance planning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer encodes images and kinematics for context

Cross attention replaces learnable queries with intent embeddings

Multiple trajectory prediction improves planning over single futures

🔎 Similar Papers

No similar papers found.