Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing safety evaluations for Large Vision-Language Models (LVLMs) focus predominantly on static images, overlooking unique security risks introduced by temporal dynamics in videos. Method: We propose Video-SafetyBench—the first video-centric multimodal safety benchmark for LVLMs—covering 48 fine-grained unsafe scenarios. We introduce a novel video-text co-attack paradigm, develop a controllable semantic-decoupled video synthesis framework, and design RJScore, a safety evaluation metric integrating confidence calibration and human-aligned decision thresholds. Contribution/Results: Experiments reveal a 67.2% attack success rate under benign queries, demonstrating that temporal dynamics significantly exacerbate LVLM safety vulnerabilities. Video-SafetyBench systematically uncovers modality-specific threats inherent to video inputs and establishes a reproducible, standardized benchmark and methodological foundation for advancing robustness research in LVLMs.

Technology Category

Application Category

📝 Abstract
The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LVLM safety under video-text attacks
Addresses gaps in multimodal safety with dynamic video inputs
Introduces a benchmark with 2,264 video-text unsafe pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-SafetyBench evaluates LVLM safety under video-text attacks
Controllable pipeline synthesizes query-relevant videos for testing
RJScore metric assesses harmful outputs with confidence calibration
X
Xuannan Liu
Beijing University of Posts and Telecommunications
Z
Zekun Li
University of California, Santa Barbara
Zheqi He
Zheqi He
Beijing Academy of Artificial Intelligence
Computer visionLLM
Peipei Li
Peipei Li
Beijing University of Posts and Telecommunications (BUPT)
Computer VisionImage SynthesisFace Recognition
Shuhan Xia
Shuhan Xia
北京邮电大学
人工智能 多模态
X
Xing Cui
Beijing University of Posts and Telecommunications
Huaibo Huang
Huaibo Huang
NLPR, MAIS, CASIA
Computer VisionGenerative ModelsLow-level VisionFace Recognition
X
Xi Yang
Beijing Academy of Artificial Intelligence
R
Ran He
Center for Research on Intelligent Perception and Computing, NLPR, CASIA