iSafetyBench: A video-language benchmark for safety in industrial environment

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing vision-language models (VLMs) lack systematic evaluation of safety understanding in high-risk industrial settings, particularly for detecting hazardous actions and multi-label anomalies. Method: We introduce SafeVid—the first fine-grained video-language benchmark explicitly designed for industrial safety—comprising real-world videos of both standard operations and safety hazards. It supports three zero-shot tasks: open-vocabulary action recognition, multi-label classification, and multiple-choice question answering. Contribution/Results: Evaluated on eight state-of-the-art VLMs, SafeVid reveals severe limitations: current models achieve <50% accuracy on hazardous action identification and multi-label safety reasoning. This work fills a critical gap in safety-oriented VLM evaluation, uncovers fundamental challenges in safety-aware perception modeling, and establishes a rigorous foundation for developing trustworthy industrial VLMs.

Technology Category

Application Category

📝 Abstract

Recent advances in vision-language models (VLMs) have enabled impressive generalization across diverse video understanding tasks under zero-shot settings. However, their capabilities in high-stakes industrial domains-where recognizing both routine operations and safety-critical anomalies is essential-remain largely underexplored. To address this gap, we introduce iSafetyBench, a new video-language benchmark specifically designed to evaluate model performance in industrial environments across both normal and hazardous scenarios. iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings, annotated with open-vocabulary, multi-label action tags spanning 98 routine and 67 hazardous action categories. Each clip is paired with multiple-choice questions for both single-label and multi-label evaluation, enabling fine-grained assessment of VLMs in both standard and safety-critical contexts. We evaluate eight state-of-the-art video-language models under zero-shot conditions. Despite their strong performance on existing video benchmarks, these models struggle with iSafetyBench-particularly in recognizing hazardous activities and in multi-label scenarios. Our results reveal significant performance gaps, underscoring the need for more robust, safety-aware multimodal models for industrial applications. iSafetyBench provides a first-of-its-kind testbed to drive progress in this direction. The dataset is available at: https://github.com/raiyaan-abdullah/iSafety-Bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs in industrial safety scenarios

Identifies gaps in hazardous activity recognition

Provides benchmark for safety-aware multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-language benchmark for industrial safety

Open-vocabulary multi-label action annotation

Zero-shot evaluation of hazardous activities

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives