๐ค AI Summary
Existing video analysis methods struggle to balance flexibility and efficiency: end-to-end vision-language models (VLMs) incur prohibitive computational costs and poor scalability for long-video understanding, while neural-symbolic approaches rely heavily on manual annotations and rigid, hand-crafted rules. This paper proposes a โlazyโ neural-symbolic collaborative architecture for open-domain video analysisโthe first to enable automatic, fine-grained decomposition and composable execution of multi-frame complex queries. The architecture integrates VLM-based semantic parsing, a relational query engine, and vector retrieval: neural components interpret semi-structured natural language queries, while symbolic modules orchestrate efficient, modular execution. Experiments demonstrate that our method retains VLM-level interactivity and usability while reducing inference cost by an order of magnitude, enabling real-time analysis of hour-long videos. It significantly improves throughput and scalability without sacrificing expressiveness or accuracy.
๐ Abstract
Current video analytics approaches face a fundamental trade-off between flexibility and efficiency. End-to-end Vision Language Models (VLMs) often struggle with long-context processing and incur high computational costs, while neural-symbolic methods depend heavily on manual labeling and rigid rule design. In this paper, we introduce LazyVLM, a neuro-symbolic video analytics system that provides a user-friendly query interface similar to VLMs, while addressing their scalability limitation. LazyVLM enables users to effortlessly drop in video data and specify complex multi-frame video queries using a semi-structured text interface for video analytics. To address the scalability limitations of VLMs, LazyVLM decomposes multi-frame video queries into fine-grained operations and offloads the bulk of the processing to efficient relational query execution and vector similarity search. We demonstrate that LazyVLM provides a robust, efficient, and user-friendly solution for querying open-domain video data at scale.