🤖 AI Summary
To address the weak generalization and poor few-shot adaptability of end-to-end trajectory planning under adverse weather, complex road networks, and uncertain human behavior, this paper proposes the first driving assistant framework integrating vision-language understanding with explicit symbolic reasoning. Our core contribution is Trajectory Preference Optimization (TPO), a novel method that couples chain-of-thought reasoning with semantic motion prediction to enable interpretable, few-shot generalizable, vision-language model (VLM)-driven planning. TPO employs a two-stage training paradigm—supervised fine-tuning followed by preference alignment optimization—to jointly model multimodal perception, object motion regression, and logic-constrained reasoning. Evaluated on nuScenes, our approach achieves a mean L2 trajectory error of 0.31 m and a collision rate of 0.10%, significantly outperforming state-of-the-art end-to-end, VLM-based, and LLM-based baselines.
📝 Abstract
Trajectory planning is a fundamental yet challenging component of autonomous driving. End-to-end planners frequently falter under adverse weather, unpredictable human behavior, or complex road layouts, primarily because they lack strong generalization or few-shot capabilities beyond their training data. We propose LLaViDA, a Large Language Vision Driving Assistant that leverages a Vision-Language Model (VLM) for object motion prediction, semantic grounding, and chain-of-thought reasoning for trajectory planning in autonomous driving. A two-stage training pipeline--supervised fine-tuning followed by Trajectory Preference Optimization (TPO)--enhances scene understanding and trajectory planning by injecting regression-based supervision, produces a powerful "VLM Trajectory Planner for Autonomous Driving." On the NuScenes benchmark, LLaViDA surpasses state-of-the-art end-to-end and other recent VLM/LLM-based baselines in open-loop trajectory planning task, achieving an average L2 trajectory error of 0.31 m and a collision rate of 0.10% on the NuScenes test set. The code for this paper is available at GitHub.