LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the weak generalization and poor few-shot adaptability of end-to-end trajectory planning under adverse weather, complex road networks, and uncertain human behavior, this paper proposes the first driving assistant framework integrating vision-language understanding with explicit symbolic reasoning. Our core contribution is Trajectory Preference Optimization (TPO), a novel method that couples chain-of-thought reasoning with semantic motion prediction to enable interpretable, few-shot generalizable, vision-language model (VLM)-driven planning. TPO employs a two-stage training paradigm—supervised fine-tuning followed by preference alignment optimization—to jointly model multimodal perception, object motion regression, and logic-constrained reasoning. Evaluated on nuScenes, our approach achieves a mean L2 trajectory error of 0.31 m and a collision rate of 0.10%, significantly outperforming state-of-the-art end-to-end, VLM-based, and LLM-based baselines.

Technology Category

Application Category

📝 Abstract

Trajectory planning is a fundamental yet challenging component of autonomous driving. End-to-end planners frequently falter under adverse weather, unpredictable human behavior, or complex road layouts, primarily because they lack strong generalization or few-shot capabilities beyond their training data. We propose LLaViDA, a Large Language Vision Driving Assistant that leverages a Vision-Language Model (VLM) for object motion prediction, semantic grounding, and chain-of-thought reasoning for trajectory planning in autonomous driving. A two-stage training pipeline--supervised fine-tuning followed by Trajectory Preference Optimization (TPO)--enhances scene understanding and trajectory planning by injecting regression-based supervision, produces a powerful "VLM Trajectory Planner for Autonomous Driving." On the NuScenes benchmark, LLaViDA surpasses state-of-the-art end-to-end and other recent VLM/LLM-based baselines in open-loop trajectory planning task, achieving an average L2 trajectory error of 0.31 m and a collision rate of 0.10% on the NuScenes test set. The code for this paper is available at GitHub.

Problem

Research questions and friction points this paper is trying to address.

Improves trajectory planning in autonomous driving under adverse conditions

Enhances generalization and few-shot capabilities beyond training data

Reduces trajectory error and collision rates in complex scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model for motion prediction and reasoning

Two-stage training with supervised fine-tuning and Trajectory Preference Optimization

Chain-of-thought reasoning for enhanced trajectory planning

🔎 Similar Papers

No similar papers found.