Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models

📅 2026-02-04

📈 Citations: 1

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the limited generalization of existing autonomous driving planning methods, which often rely on simulated environments or fixed commands and struggle to interpret free-form natural language instructions in real-world settings. Leveraging doScenes—the first real-world dataset of aligned free-form instructions and vehicle trajectories—the authors propose integrating front-view images, ego-vehicle states, and passenger-style natural language commands into the OpenEMMA multimodal large language model framework to enable language-conditioned trajectory planning. Experimental results across 849 annotated scenarios demonstrate that instruction guidance reduces outlier average displacement error (ADE) by 98.7%; even after excluding outliers, high-quality instructions further decrease ADE by 5.1%, significantly enhancing the system’s ability to understand and respond to human intent.

Technology Category

Application Category

📝 Abstract

Instruction-grounded driving, where passenger language guides trajectory planning, requires vehicles to understand intent before motion. However, most prior instruction-following planners rely on simulation or fixed command vocabularies, limiting real-world generalization. doScenes, the first real-world dataset linking free-form instructions (with referentiality) to nuScenes ground-truth motion, enables instruction-conditioned planning. In this work, we adapt OpenEMMA, an open-source MLLM-based end-to-end driving framework that ingests front-camera views and ego-state and outputs 10-step speed-curvature trajectories, to this setting, presenting a reproducible instruction-conditioned baseline on doScenes and investigate the effects of human instruction prompts on predicted driving behavior. We integrate doScenes directives as passenger-style prompts within OpenEMMA's vision-language interface, enabling linguistic conditioning before trajectory generation. Evaluated on 849 annotated scenes using ADE, we observe that instruction conditioning substantially improves robustness by preventing extreme baseline failures, yielding a 98.7% reduction in mean ADE. When such outliers are removed, instructions still influence trajectory alignment, with well-phrased prompts improving ADE by up to 5.1%. We use this analysis to discuss what makes a"good"instruction for the OpenEMMA framework. We release the evaluation prompts and scripts to establish a reproducible baseline for instruction-aware planning. GitHub: https://github.com/Mi3-Lab/doScenes-VLM-Planning

Problem

Research questions and friction points this paper is trying to address.

instruction-grounded driving

human-in-the-loop

vision-language-action models

autonomous driving

motion planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-grounded driving

vision-language-action models

doScenes dataset