LAVA: Language Driven Scalable and Versatile Traffic Video Analytics

📅 2025-07-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Urban camera networks generate petabyte-scale video data, yet existing SQL-based querying paradigms are constrained by predefined semantic categories and thus ill-suited for flexible, open-ended analytical tasks. This paper proposes a language-driven video analytics paradigm, introducing an end-to-end natural language–to–visual content parsing system. Our approach features three core contributions: (1) an adaptive video segment sampling strategy based on multi-armed bandits; (2) an open-world video object detection module capable of recognizing unseen or rare categories; and (3) a temporally aware long-trajectory extraction mechanism integrating sequential modeling. Evaluated on a custom traffic video benchmark, our system achieves a 14% improvement in F1 score, reduces aggregation query error by 0.39, attains 86% Top-k accuracy, and processes queries 9.6× faster than the best baseline.

Technology Category

Application Category

📝 Abstract
In modern urban environments, camera networks generate massive amounts of operational footage -- reaching petabytes each day -- making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility. In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language. Particularly, we build extsc{Lava}, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. extsc{Lava} comprises three main components: 1) a multi-armed bandit-based efficient sampling method for video segment-level localization; 2) a video-specific open-world detection module for object-level retrieval; and 3) a long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests. To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos. Experiments on this benchmark demonstrate that extsc{Lava} improves $F_1$-scores for selection queries by $mathbf{14%}$, reduces MPAE for aggregation queries by $mathbf{0.39}$, and achieves top-$k$ precision of $mathbf{86%}$, while processing videos $ mathbf{9.6 imes} $ faster than the most accurate baseline.
Problem

Research questions and friction points this paper is trying to address.

Enabling flexible natural language queries for traffic video analytics
Overcoming rigid SQL-based query constraints in video databases
Improving efficiency and accuracy in large-scale video data processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Natural language-driven video query system
Multi-armed bandit-based sampling localization
Open-world detection for object retrieval
🔎 Similar Papers
No similar papers found.
Yanrui Yu
Yanrui Yu
Beijing Institute of Technology, Beijing, China
Tianfei Zhou
Tianfei Zhou
Beijing Institute of Technology | ETH Zurich
Artificial IntelligenceMedical AIComputer Vision
J
Jiaxin Sun
Beijing Institute of Technology, Beijing, China
L
Lianpeng Qiao
Beijing Institute of Technology, Beijing, China
L
Lizhong Ding
Beijing Institute of Technology, Beijing, China
Y
Ye Yuan
Beijing Institute of Technology, Beijing, China
Guoren Wang
Guoren Wang
Beijing Institute of Technology