HERMES: A Holistic End-to-End Risk-Aware Multimodal Embodied System with Vision-Language Models for Long-Tail Autonomous Driving

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the challenge of insufficient risk awareness in end-to-end autonomous driving within long-tail mixed-traffic scenarios, where balancing safety and accuracy remains difficult. To this end, the authors propose a holistic, risk-aware multimodal driving framework that integrates structured annotation to construct long-tail scenarios and planning context. The framework features a novel three-modality fusion architecture—combining multi-view perception, historical motion, and semantic guidance—and, for the first time, explicitly incorporates risk-centric cues, operational intent, and safety preferences into end-to-end trajectory planning. Leveraging a vision–language model–based annotation pipeline, the proposed approach significantly outperforms existing end-to-end and VLM-driven methods on real-world long-tail datasets. Ablation studies further confirm the effectiveness and complementarity of the individual components.

Technology Category

Application Category

📝 Abstract

End-to-end autonomous driving models increasingly benefit from large vision--language models for semantic understanding, yet ensuring safe and accurate operation under long-tail conditions remains challenging. These challenges are particularly prominent in long-tail mixed-traffic scenarios, where autonomous vehicles must interact with heterogeneous road users, including human-driven vehicles and vulnerable road users, under complex and uncertain conditions. This paper proposes HERMES, a holistic risk-aware end-to-end multimodal driving framework designed to inject explicit long-tail risk cues into trajectory planning. HERMES employs a foundation-model-assisted annotation pipeline to produce structured Long-Tail Scene Context and Long-Tail Planning Context, capturing hazard-centric cues together with maneuver intent and safety preference, and uses these signals to guide end-to-end planning. HERMES further introduces a Tri-Modal Driving Module that fuses multi-view perception, historical motion cues, and semantic guidance, ensuring risk-aware accurate trajectory planning under long-tail scenarios. Experiments on the real-world long-tail dataset demonstrate that HERMES consistently outperforms representative end-to-end and VLM-driven baselines under long-tail mixed-traffic scenarios. Ablation studies verify the complementary contributions of key components.

Problem

Research questions and friction points this paper is trying to address.

long-tail scenarios

autonomous driving

mixed-traffic

risk-aware

heterogeneous road users

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-tail driving

risk-aware planning

vision-language models