🤖 AI Summary
Autonomous driving systems exhibit insufficient robustness in rare, ambiguous, and out-of-distribution scenarios, whereas humans effectively reason using contextual cues and commonsense knowledge. Existing shared autonomy approaches operate primarily at the low-level trajectory arbitration layer, failing to preserve high-level driver intent. To address this, we propose the first semantic-level unified shared autonomy framework: it elevates arbitration to a vision-language model (VLM)-driven high-level intention representation space, integrating multimodal cues—including driver behavior and environmental perception—for joint human–machine intention inference and policy fusion. Experiments demonstrate 100% arbitration recall and high precision in simulation; 92% of participants rated its decisions as reasonable; and on the Bench2Drive benchmark, it significantly reduces collision rates compared to fully autonomous baselines. This work establishes a paradigm shift from trajectory-level to intention-level shared control, enabling more interpretable, robust, and human-aligned autonomous driving.
📝 Abstract
Autonomous driving systems remain brittle in rare, ambiguous, and out-of-distribution scenarios, where human driver succeed through contextual reasoning. Shared autonomy has emerged as a promising approach to mitigate such failures by incorporating human input when autonomy is uncertain. However, most existing methods restrict arbitration to low-level trajectories, which represent only geometric paths and therefore fail to preserve the underlying driving intent. We propose a unified shared autonomy framework that integrates human input and autonomous planners at a higher level of abstraction. Our method leverages Vision Language Models (VLMs) to infer driver intent from multi-modal cues -- such as driver actions and environmental context -- and to synthesize coherent strategies that mediate between human and autonomous control. We first study the framework in a mock-human setting, where it achieves perfect recall alongside high accuracy and precision. A human-subject survey further shows strong alignment, with participants agreeing with arbitration outcomes in 92% of cases. Finally, evaluation on the Bench2Drive benchmark demonstrates a substantial reduction in collision rate and improvement in overall performance compared to pure autonomy. Arbitration at the level of semantic, language-based representations emerges as a design principle for shared autonomy, enabling systems to exercise common-sense reasoning and maintain continuity with human intent.