PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work proposes PrismVAU, a novel system for video anomaly understanding that addresses the reliance on expensive annotations, complex training procedures, and external modules. PrismVAU introduces automatic prompt engineering (APE) to this domain for the first time, leveraging off-the-shelf multimodal large language models (MLLMs) within a two-stage pipeline: coarse anomaly scoring guided by textual anchors followed by prompt-based contextual refinement. The approach requires no model fine-tuning or frame-level annotations, operating instead under weak supervision to optimize textual anchors and system prompts—significantly reducing computational overhead. Evaluated on standard benchmarks, PrismVAU achieves competitive performance while generating interpretable natural-language descriptions of detected anomalies, thereby meeting the latency constraints of real-time applications.

Technology Category

Application Category

📝 Abstract

Video Anomaly Understanding (VAU) extends traditional Video Anomaly Detection (VAD) by not only localizing anomalies but also describing and reasoning about their context. Existing VAU approaches often rely on fine-tuned multimodal large language models (MLLMs) or external modules such as video captioners, which introduce costly annotations, complex training pipelines, and high inference overhead. In this work, we introduce PrismVAU, a lightweight yet effective system for real-time VAU that leverages a single off-the-shelf MLLM for anomaly scoring, explanation, and prompt optimization. PrismVAU operates in two complementary stages: (1) a coarse anomaly scoring module that computes frame-level anomaly scores via similarity to textual anchors, and (2) an MLLM-based refinement module that contextualizes anomalies through system and user prompts. Both textual anchors and prompts are optimized with a weakly supervised Automatic Prompt Engineering (APE) framework. Extensive experiments on standard VAD benchmarks demonstrate that PrismVAU delivers competitive detection performance and interpretable anomaly explanations -- without relying on instruction tuning, frame-level annotations, and external modules or dense processing -- making it an efficient and practical solution for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Video Anomaly Understanding

Multimodal Large Language Models

Annotation Cost

Inference Overhead

External Modules

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Anomaly Understanding

Multimodal Large Language Model

Prompt Engineering