LazyVLM: Neuro-Symbolic Approach to Video Analytics

๐Ÿ“… 2025-05-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video analysis methods struggle to balance flexibility and efficiency: end-to-end vision-language models (VLMs) incur prohibitive computational costs and poor scalability for long-video understanding, while neural-symbolic approaches rely heavily on manual annotations and rigid, hand-crafted rules. This paper proposes a โ€œlazyโ€ neural-symbolic collaborative architecture for open-domain video analysisโ€”the first to enable automatic, fine-grained decomposition and composable execution of multi-frame complex queries. The architecture integrates VLM-based semantic parsing, a relational query engine, and vector retrieval: neural components interpret semi-structured natural language queries, while symbolic modules orchestrate efficient, modular execution. Experiments demonstrate that our method retains VLM-level interactivity and usability while reducing inference cost by an order of magnitude, enabling real-time analysis of hour-long videos. It significantly improves throughput and scalability without sacrificing expressiveness or accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Current video analytics approaches face a fundamental trade-off between flexibility and efficiency. End-to-end Vision Language Models (VLMs) often struggle with long-context processing and incur high computational costs, while neural-symbolic methods depend heavily on manual labeling and rigid rule design. In this paper, we introduce LazyVLM, a neuro-symbolic video analytics system that provides a user-friendly query interface similar to VLMs, while addressing their scalability limitation. LazyVLM enables users to effortlessly drop in video data and specify complex multi-frame video queries using a semi-structured text interface for video analytics. To address the scalability limitations of VLMs, LazyVLM decomposes multi-frame video queries into fine-grained operations and offloads the bulk of the processing to efficient relational query execution and vector similarity search. We demonstrate that LazyVLM provides a robust, efficient, and user-friendly solution for querying open-domain video data at scale.
Problem

Research questions and friction points this paper is trying to address.

Balancing flexibility and efficiency in video analytics
Overcoming scalability limitations of Vision Language Models
Reducing manual labeling in neural-symbolic video analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuro-symbolic approach for video analytics
Decomposes queries into fine-grained operations
Uses relational query and vector search
๐Ÿ”Ž Similar Papers
No similar papers found.