A Vision-Language-Action Model with Visual Prompt for OFF-Road Autonomous Driving

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes OFF-EMMA, an end-to-end multimodal framework addressing the inefficiency and poor adaptability of trajectory planning in off-road autonomous driving, which stem from inadequate spatial perception and unstable reasoning. OFF-EMMA enhances the spatial understanding of a pre-trained vision-language model by incorporating semantic segmentation masks as visual prompts and introduces a Chain-of-Thought Self-Consistency (CoT-SC) reasoning mechanism to integrate visual, linguistic, and action information. This enables multi-path consistency verification to suppress anomalous planning outputs. Evaluated on the RELLIS-3D dataset, OFF-EMMA reduces the average L2 error by 13.3% and significantly lowers the failure rate from 16.52% to 6.56% compared to a Qwen backbone model, demonstrating its superior performance and robustness in complex off-road environments.

Technology Category

Application Category

📝 Abstract
Efficient trajectory planning in off-road terrains presents a formidable challenge for autonomous vehicles, often necessitating complex multi-step pipelines. However, traditional approaches exhibit limited adaptability in dynamic environments. To address these limitations, this paper proposes OFF-EMMA, a novel end-to-end multimodal framework designed to overcome the deficiencies of insufficient spatial perception and unstable reasoning in visual-language-action (VLA) models for off-road autonomous driving scenarios. The framework explicitly annotates input images through the design of a visual prompt block and introduces a chain-of-thought with self-consistency (COT-SC) reasoning strategy to enhance the accuracy and robustness of trajectory planning. The visual prompt block utilizes semantic segmentation masks as visual prompts, enhancing the spatial understanding ability of pre-trained visual-language models for complex terrains. The COT- SC strategy effectively mitigates the error impact of outliers on planning performance through a multi-path reasoning mechanism. Experimental results on the RELLIS-3D off-road dataset demonstrate that OFF-EMMA significantly outperforms existing methods, reducing the average L2 error of the Qwen backbone model by 13.3% and decreasing the failure rate from 16.52% to 6.56%.
Problem

Research questions and friction points this paper is trying to address.

off-road autonomous driving
trajectory planning
visual-language-action model
spatial perception
dynamic environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual prompt
chain-of-thought reasoning
self-consistency
vision-language-action model
off-road autonomous driving
🔎 Similar Papers
No similar papers found.
L
Liangdong Zhang
Defense Innovation Institute, Academy of Military Sciences, Beijing 100071, China
Y
Yiming Nie
Defense Innovation Institute, Academy of Military Sciences, Beijing 100071, China
H
Haoyang Li
Information Engineering College, Nanchang University, Nanchang 330031, China
Fanjie Kong
Fanjie Kong
Duke University
Machine LearningComputer VisionFairnessMedical Image Analysis
Baobao Zhang
Baobao Zhang
Syracuse University
Political SciencePublic PolicyTechnology Policy
S
Shunxin Huang
Defense Innovation Institute, Academy of Military Sciences, Beijing 100071, China
Kai Fu
Kai Fu
The University of Utah
GaNGa2O3WBG/UWBGPower Electronics
C
Chen Min
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
L
Liang Xiao
Defense Innovation Institute, Academy of Military Sciences, Beijing 100071, China