Large Language Models for Video Surveillance Applications

๐Ÿ“… 2025-01-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenges of massive surveillance video data, high storage costs, and inefficient semantic retrieval, this paper proposes an end-to-end generative semantic summarization method based on vision-language models (VLMs). We pioneer the application of VLMs to spatiotemporally consistent, event-level textual summarization of surveillance videos, integrating temporal segmentation encoding, cross-modal alignment, and prompt-driven summarization. This approach overcomes key limitations of conventional action recognition and generic video summarization techniques. Evaluated on a real-world CCTV dataset, our method achieves 80% accuracy in event temporal localization and 70% spatial consistency. The generated summaries achieve >99.9% compression ratio relative to raw video, drastically reducing long-term storage overhead. Moreover, the system enables sub-second event retrieval and supports natural languageโ€“based interactive querying, enhancing operational efficiency and semantic accessibility in large-scale video surveillance systems.

Technology Category

Application Category

๐Ÿ“ Abstract
The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.
Problem

Research questions and friction points this paper is trying to address.

Video Surveillance
Data Management
Textual Representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative AI
Visual Language Models
Video Analysis Efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.
Leon Fernando
Leon Fernando
Graduate from University of Moratuwa,Sri Lanka
Computer VisionNatural Language ProcessingComputer Architecture
Z
Zann Koh
Engineering and Product Development, SUTD, Singapore
S
S. C. Joyce
Architecture and Sustainable Design, SUTD, Singapore
B
Belinda Yuen
Lee Kuan Yew Centre for Innovative Cities, SUTD, Singapore
Chau Yuen
Chau Yuen
IEEE Fellow, Highly Cited Researcher, Nanyang Technological University
WirelessSmart GridLocalizationIoTBig Data