Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

📅 2025-11-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VAD/VAU benchmarks suffer from limited scene diversity, imbalanced anomaly class distribution, and insufficient temporal complexity; moreover, VAU demands deep semantic understanding and causal reasoning, rendering manual annotation costly and unscalable. To address these limitations, we introduce Pistachio—the first fully generative video anomaly benchmark—built upon state-of-the-art video diffusion models. It employs scene-conditioned anomaly injection, multi-stage narrative planning, and long-sequence temporal consistency optimization to synthesize coherent, controllable 41-second abnormal videos. Pistachio enables diverse, balanced, and long-horizon (multi-event) anomaly modeling, effectively mitigating real-data biases and annotation bottlenecks. Experiments demonstrate that Pistachio surpasses existing benchmarks in scale, diversity, and temporal complexity, exposing critical weaknesses of mainstream methods in dynamic, long-duration anomaly understanding. It establishes a new standard and identifies key challenges for future VAU research.

Technology Category

Application Category

📝 Abstract
Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.
Problem

Research questions and friction points this paper is trying to address.

Existing VAD benchmarks lack diversity and balanced anomaly coverage
Video Anomaly Understanding requires deeper reasoning but lacks benchmarks
Manual annotation for VAU benchmarks demands heavy human effort
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generation-based pipeline for video anomaly benchmarks
Scene-conditioned anomaly assignment and storyline generation
Temporally consistent long-form synthesis strategy
🔎 Similar Papers
No similar papers found.
J
Jie Li
University of Science & Technology Beijing
Hongyi Cai
Hongyi Cai
University of Malaya
Data-centric AIAI for EfficiencyComputer Vision
M
Mingkang Dong
Monash University
Muxin Pu
Muxin Pu
Monash University
Software TestingComputer Vision
Shan You
Shan You
SenseTime Research
deep learningmultimodal LLMedge AI
F
Fei Wang
Shanghai Jiaotong University
T
Tao Huang
Shanghai Jiaotong University