Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of autonomous driving systems in safety assessment and planning under complex and out-of-distribution scenarios by proposing a system-level approach grounded in vision–language joint representations. The method integrates lightweight, class-agnostic hazard screening, task-aligned vision–language embedding extraction, and a natural language instruction-guided behavior-constrained planning framework to enable semantic risk representation and safe decision-making. By fusing CLIP-based image–text similarity, scene-level multimodal embeddings, and a Transformer-based trajectory planner, the approach demonstrates low-latency, multi-class hazard detection on the Waymo and nuScenes benchmarks, significantly reducing severe planning failures and enhancing behavioral consistency and safety in ambiguous driving situations.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
autonomous driving
safety assessment
motion planning
semantic reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models
semantic hazard detection
task-aligned representation
language-conditioned planning
autonomous driving safety
🔎 Similar Papers
No similar papers found.
Ross Greer
Ross Greer
University of California Merced
Artificial IntelligenceMachine VisionAutonomous DrivingHuman-Robot InteractionComputer Music
M
Maitrayee Keskar
Machine Intelligence, Interaction, and Imagination (Mi³) Laboratory, University of California, Merced, USA; Laboratory for Intelligent and Safe Automobiles (LISA), University of California, San Diego, USA
A
Angel Martinez-Sanchez
Machine Intelligence, Interaction, and Imagination (Mi³) Laboratory, University of California, Merced, USA
P
Parthib Roy
Machine Intelligence, Interaction, and Imagination (Mi³) Laboratory, University of California, Merced, USA
S
Shashank Shriram
Machine Intelligence, Interaction, and Imagination (Mi³) Laboratory, University of California, Merced, USA
Mohan Trivedi
Mohan Trivedi
Distinguished Professor ECE, Univ California, San Diego, Director CVRR and LISA Labs
Intelligent VehiclesAutonomous DrivingMachine VisionDriver Assistance SystemsHuman-Robot Interaction