VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the challenge of accurately capturing subtle variations in human driving preferences, which existing imitation learning methods struggle to model. To this end, it introduces vision-language models (VLMs) into preference alignment for autonomous driving, leveraging their zero-shot reasoning capability to automatically generate high-quality preference pairs from pre-trained trajectories. These synthetic preferences are then used to fine-tune a motion prediction model via Direct Preference Optimization (DPO), achieving strong alignment with human driving behaviors without any manual annotation. Experiments on the Waymo Open Dataset for End-to-End Driving (WOD-E2E) demonstrate that the proposed approach reduces the Average Displacement Error (ADE) by 10.01% and improves the Rate of Following Success (RFS) by 11.94% over baseline models, significantly enhancing the consistency of autonomous driving behavior with human preferences.
📝 Abstract
The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.
Problem

Research questions and friction points this paper is trying to address.

autonomous driving
human preference alignment
motion forecasting
vision-language models
preference optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model
Direct Preference Optimization
Autonomous Driving
Preference Alignment
Motion Forecasting
🔎 Similar Papers
No similar papers found.