Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot detection of unknown-category hazardous objects—particularly unpredictable anomalies in video streams—remains a critical challenge in autonomous driving. Method: This paper proposes a vision-language multi-agent collaborative framework: (i) a novel agent architecture integrating Vision-Language Models (VLMs) and Large Language Models (LLMs) for joint hazard localization and natural-language explanation; (ii) COOOL, an extended benchmark with fine-grained textual annotations; and (iii) a video-level zero-shot evaluation paradigm based on cross-modal semantic similarity. The method leverages CLIP-based alignment, zero-shot object detection, and video-semantic matching. Contributions/Results: It achieves significant improvements in both localization accuracy and semantic identification of unseen hazards. We publicly release COOOLER—a comprehensive toolkit including models, code, and the COOOL dataset—and provide the first systematic analysis identifying key bottlenecks of multimodal approaches in real-time driving scenarios.

Technology Category

Application Category

📝 Abstract
Detecting anomalous hazards in visual data, particularly in video streams, is a critical challenge in autonomous driving. Existing models often struggle with unpredictable, out-of-label hazards due to their reliance on predefined object categories. In this paper, we propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection to improve hazard identification and explanation. Our pipeline consists of a Vision-Language Model (VLM), a Large Language Model (LLM), in order to detect hazardous objects within a traffic scene. We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations, improving localization accuracy. To assess model performance, we create a ground truth dataset by denoising and extending the foundational COOOL (Challenge-of-Out-of-Label) anomaly detection benchmark dataset with complete natural language descriptions for hazard annotations. We define a means of hazard detection and labeling evaluation on the extended dataset using cosine similarity. This evaluation considers the semantic similarity between the predicted hazard description and the annotated ground truth for each video. Additionally, we release a set of tools for structuring and managing large-scale hazard detection datasets. Our findings highlight the strengths and limitations of current vision-language-based approaches, offering insights into future improvements in autonomous hazard detection systems. Our models, scripts, and data can be found at https://github.com/mi3labucm/COOOLER.git
Problem

Research questions and friction points this paper is trying to address.

Detect unpredictable hazards in autonomous driving videos
Improve hazard identification using vision-language models
Evaluate detection accuracy with semantic similarity metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates vision-language reasoning with zero-shot detection
Uses CLIP model for accurate hazard localization
Extends COOOL dataset with natural language annotations
🔎 Similar Papers
No similar papers found.
S
Shashank Shriram
University of California, Merced
S
Srinivasa Perisetla
University of California, Merced
A
Aryan Keskar
University of California, Merced
H
Harsha Krishnaswamy
University of California, Merced
T
T. E. W. Bossen
Aalborg Universitet
Andreas Møgelmose
Andreas Møgelmose
Associate professor, Aalborg University
Computer visionMachine learningAIIndustrial visionDriver assistance systems
Ross Greer
Ross Greer
University of California Merced
Artificial IntelligenceMachine VisionAutonomous DrivingHuman-Robot InteractionComputer Music