Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs

πŸ“… 2026-01-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language model (VLM) safety evaluation benchmarks struggle to encompass complex, dynamic hazardous scenarios, particularly lacking spatiotemporal modeling of moving, intrusive, and distant objects. To address this gap, this work proposes HazardForge, a pipeline integrating image editing models, layout decision algorithms, and a scene validation module to enable the first scalable generation of anomalous driving scenes. Leveraging this pipeline, the authors construct MovSafeBench, a large-scale multiple-choice question-answering benchmark comprising 7,254 images and corresponding QA pairs across 13 categories of dynamic objects. Experimental results reveal a significant performance drop in VLMs under anomalous conditions, especially in tasks requiring fine-grained motion understanding, thereby highlighting critical limitations in current models’ capacity for safe decision-making.

Technology Category

Application Category

πŸ“ Abstract
Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems, making it crucial to evaluate their ability to support safer decision-making in complex environments. However, existing benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with spatio-temporal dynamics. While image editing models are a promising means to synthesize such hazards, it remains challenging to generate well-formulated scenarios that include moving, intrusive, and distant objects frequently observed in the real world. To address this gap, we introduce \textbf{HazardForge}, a scalable pipeline that leverages image editing models to generate these scenarios with layout decision algorithms, and validation modules. Using HazardForge, we construct \textbf{MovSafeBench}, a multiple-choice question (MCQ) benchmark comprising 7,254 images and corresponding QA pairs across 13 object categories, covering both normal and anomalous objects. Experiments using MovSafeBench show that VLM performance degrades notably under conditions including anomalous objects, with the largest drop in scenarios requiring nuanced motion understanding.
Problem

Research questions and friction points this paper is trying to address.

Vision Language Models
hazardous scenarios
anomalous objects
spatio-temporal dynamics
autonomous vehicles
Innovation

Methods, ideas, or system contributions that make the work stand out.

HazardForge
MovSafeBench
Vision Language Models
anomalous scenario generation
scalable evaluation
πŸ”Ž Similar Papers
No similar papers found.
T
Takara Taniguchi
OMRON SINIC X, Nagase Hongo Building 3F 5-24-5 Hongo, Bunkyo-ku Tokyo-to, Japan
Kuniaki Saito
Kuniaki Saito
Boston University
Artificial IntelligenceMachine LearningComputer Vision
A
Atsushi Hashimoto
OMRON SINIC X, Nagase Hongo Building 3F 5-24-5 Hongo, Bunkyo-ku Tokyo-to, Japan