DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE

📅 2026-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Deep Safety-oriented Video Understanding (DeepSVU), a novel paradigm that advances beyond conventional threat detection and localization by explicitly incorporating threat causality attribution and assessment. To this end, we introduce a Unified Physical-world Enhanced Mixture-of-Experts (UPE) module and a Physical-world Trade-off Regularizer (PTR), enabling adaptive multi-granular fusion of physical-world information from coarse to fine levels. Integrated into a video large language model (VLLM) architecture, the proposed framework supports end-to-end training and demonstrates significant performance gains over both existing video foundation models and non-VLM approaches on the UCF-Crime and CUVA instruction datasets. Our results validate that explicit modeling of physical-world dynamics substantially enhances safety-oriented video understanding capabilities.

Technology Category

Application Category

📝 Abstract
In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.
Problem

Research questions and friction points this paper is trying to address.

Security-oriented Video Understanding
threat cause attribution
physical-world information
video understanding
DeepSVU
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeepSVU
Physical-world Regularization
Mixture of Experts (MoE)
Security-oriented Video Understanding
Cause Attribution
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30