🤖 AI Summary
This work addresses the weak alignment between language annotations and vehicle actions or terrain geometry in existing off-road driving datasets, which limits the ability of vision-language models to generate accurate 3D trajectories. To overcome this, the authors propose an action-aligned language refinement framework combined with a terrain-aware negative-sample preference optimization strategy, enabling direct generation of geometrically consistent 3D trajectories from a single image. The study introduces two novel evaluation metrics—traversability compliance and elevation contour consistency—to better assess trajectory quality. Evaluated on the ORAD-3D benchmark, the method reduces the average trajectory error from 1.01 m to 0.97 m, improves traversability compliance to 0.644, and lowers elevation inconsistency to 0.322, demonstrating enhanced geometric fidelity and terrain awareness.
📝 Abstract
While Vision-Language Models (VLMs) enable high-level semantic reasoning for end-to-end autonomous driving, particularly in unstructured environments, existing off-road datasets suffer from language annotations that are weakly aligned with vehicle actions and terrain geometry. To address this misalignment, we propose a language refinement framework that restructures annotations into action-aligned pairs, enabling a VLM to generate refined scene descriptions and 3D future trajectories directly from a single image. To further encourage terrain-aware planning, we introduce a preference optimization strategy that constructs geometry-aware hard negatives and explicitly penalizes trajectories inconsistent with local elevation profiles. Furthermore, we propose off-road-specific metrics to quantify traversability compliance and elevation consistency, addressing the limitations of conventional on-road evaluation. Experiments on the ORAD-3D benchmark demonstrate that our approach reduces average trajectory error from 1.01m to 0.97m, improves traversability compliance from 0.621 to 0.644, and decreases elevation inconsistency from 0.428 to 0.322, highlighting the efficacy of action-aligned supervision and terrain-aware optimization for robust off-road driving.