LazyVLM: Neuro-Symbolic Approach to Video Analytics

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing video analysis methods struggle to balance flexibility and efficiency: end-to-end vision-language models (VLMs) incur prohibitive computational costs and poor scalability for long-video understanding, while neural-symbolic approaches rely heavily on manual annotations and rigid, hand-crafted rules. This paper proposes a “lazy” neural-symbolic collaborative architecture for open-domain video analysis—the first to enable automatic, fine-grained decomposition and composable execution of multi-frame complex queries. The architecture integrates VLM-based semantic parsing, a relational query engine, and vector retrieval: neural components interpret semi-structured natural language queries, while symbolic modules orchestrate efficient, modular execution. Experiments demonstrate that our method retains VLM-level interactivity and usability while reducing inference cost by an order of magnitude, enabling real-time analysis of hour-long videos. It significantly improves throughput and scalability without sacrificing expressiveness or accuracy.

Technology Category

Application Category

📝 Abstract

Current video analytics approaches face a fundamental trade-off between flexibility and efficiency. End-to-end Vision Language Models (VLMs) often struggle with long-context processing and incur high computational costs, while neural-symbolic methods depend heavily on manual labeling and rigid rule design. In this paper, we introduce LazyVLM, a neuro-symbolic video analytics system that provides a user-friendly query interface similar to VLMs, while addressing their scalability limitation. LazyVLM enables users to effortlessly drop in video data and specify complex multi-frame video queries using a semi-structured text interface for video analytics. To address the scalability limitations of VLMs, LazyVLM decomposes multi-frame video queries into fine-grained operations and offloads the bulk of the processing to efficient relational query execution and vector similarity search. We demonstrate that LazyVLM provides a robust, efficient, and user-friendly solution for querying open-domain video data at scale.

Problem

Research questions and friction points this paper is trying to address.

Balancing flexibility and efficiency in video analytics

Overcoming scalability limitations of Vision Language Models

Reducing manual labeling in neural-symbolic video analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuro-symbolic approach for video analytics

Decomposes queries into fine-grained operations

Uses relational query and vector search

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs