🤖 AI Summary
This work addresses the limitations of autonomous driving systems in safety assessment and planning under complex and out-of-distribution scenarios by proposing a system-level approach grounded in vision–language joint representations. The method integrates lightweight, class-agnostic hazard screening, task-aligned vision–language embedding extraction, and a natural language instruction-guided behavior-constrained planning framework to enable semantic risk representation and safe decision-making. By fusing CLIP-based image–text similarity, scene-level multimodal embeddings, and a Transformer-based trajectory planner, the approach demonstrates low-latency, multi-class hazard detection on the Waymo and nuScenes benchmarks, significantly reducing severe planning failures and enhancing behavioral consistency and safety in ambiguous driving situations.
📝 Abstract
Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.