E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end autonomous driving systems neglect passenger affective states, hindering human-centered experience and user acceptance. To address this, we propose E3AD—a novel emotion-empowered end-to-end autonomous driving framework that pioneers deep integration of emotion perception into vision-language-action (VLA) joint decision-making. Specifically, E3AD introduces continuous Valence-Arousal-Dominance (VAD) emotion representation, a dual-path spatial reasoning module unifying egocentric and allocentric perspectives, and a cognition-inspired alignment mechanism ensuring consistency between affective intent and driving actions. Training leverages multimodal pretraining, multi-view joint modeling, and preference-aligned optimization. Evaluated on real-world datasets, E3AD achieves significant improvements in trajectory waypoint prediction and visual localization, while attaining state-of-the-art correlation in emotion estimation. Results demonstrate that affective co-reasoning delivers critical gains for human-centered autonomous driving.

Technology Category

Application Category

📝 Abstract
End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.
Problem

Research questions and friction points this paper is trying to address.

Incorporates passenger emotional state into autonomous driving decisions
Interprets natural-language commands and plans feasible vehicle trajectories
Enhances visual grounding and waypoint planning with emotion-aware reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous VAD emotion model captures tone and urgency from language
Dual-pathway spatial reasoning fuses egocentric and allocentric views
Consistency-oriented training aligns emotional intent with driving actions
🔎 Similar Papers
No similar papers found.