Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark

๐Ÿ“… 2025-06-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video anomaly retrieval datasets suffer from scarce real-world samples, privacy constraints, and inadequate coverage of long-tailed anomaly categories. To address these limitations, we introduce SVTAโ€”the first large-scale synthetic video-text anomaly retrieval benchmark. SVTA leverages large language models to generate fine-grained textual descriptions for 68 long-tailed anomaly classes, which drive a video diffusion model to synthesize 41,315 high-fidelity, temporally annotated videos (1.36M frames), covering 30 normal and 68 abnormal event categories. SVTA establishes the first fully synthetic cross-modal anomaly retrieval paradigm, ensuring privacy preservation while maintaining scene realism and long-tailed generalizability, and enabling natural-language-driven, fine-grained anomaly localization. Extensive experiments demonstrate significant performance degradation across three state-of-the-art retrieval models on SVTA, validating its effectiveness and challenge as a robustness evaluation benchmark.

Technology Category

Application Category

๐Ÿ“ Abstract
Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety. However, existing datasets suffer from severe limitations: (1) data scarcity due to the long-tail nature of real-world anomalies, and (2) privacy constraints that impede large-scale collection. To address the aforementioned issues in one go, we introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval, leveraging generative models to overcome data availability challenges. Specifically, we collect and generate video descriptions via the off-the-shelf LLM (Large Language Model) covering 68 anomaly categories, e.g., throwing, stealing, and shooting. These descriptions encompass common long-tail events. We adopt these texts to guide the video generative model to produce diverse and high-quality videos. Finally, our SVTA involves 41,315 videos (1.36M frames) with paired captions, covering 30 normal activities, e.g., standing, walking, and sports, and 68 anomalous events, e.g., falling, fighting, theft, explosions, and natural disasters. We adopt three widely-used video-text retrieval baselines to comprehensively test our SVTA, revealing SVTA's challenging nature and its effectiveness in evaluating a robust cross-modal retrieval method. SVTA eliminates privacy risks associated with real-world anomaly collection while maintaining realistic scenarios. The dataset demo is available at: [https://svta-mm.github.io/SVTA.github.io/].
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in video anomaly retrieval using synthetic datasets
Overcoming privacy constraints with generative models for anomaly videos
Evaluating cross-modal retrieval methods for diverse anomaly categories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages generative models for synthetic data
Uses LLM for diverse anomaly descriptions
Combines video-text pairs for retrieval
๐Ÿ”Ž Similar Papers
No similar papers found.