Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report]

📅 2024-08-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing video data management systems rely on predefined subtask modules, limiting support for open-domain compositional video queries. Method: We propose VOCAL-UDF, the first system enabling automatic query composition without predefined modules. It leverages large language models (LLMs) to interpret user intent, identify functional gaps, and generate either programmatic or distilled vision-based user-defined functions (UDFs), forming a self-augmenting architecture. To mitigate intent ambiguity, it incorporates multi-candidate UDF generation and active learning. The system integrates LLM-driven function synthesis, lightweight vision model distillation, and modular query decomposition. Contribution/Results: Evaluated on three video datasets, VOCAL-UDF significantly improves query accuracy and cross-task generalization, demonstrating that complex compositional queries can be efficiently answered without manual predefinition of modules.

Technology Category

Application Category

📝 Abstract

Complex video queries can be answered by decomposing them into modular subtasks. However, existing video data management systems assume the existence of predefined modules for each subtask. We introduce VOCAL-UDF, a novel self-enhancing system that supports compositional queries over videos without the need for predefined modules. VOCAL-UDF automatically identifies and constructs missing modules and encapsulates them as user-defined functions (UDFs), thus expanding its querying capabilities. To achieve this, we formulate a unified UDF model that leverages large language models (LLMs) to aid in new UDF generation. VOCAL-UDF handles a wide range of concepts by supporting both program-based UDFs (i.e., Python functions generated by LLMs) and distilled-model UDFs (lightweight vision models distilled from strong pretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF generates multiple candidate UDFs and uses active learning to efficiently select the best one. With the self-enhancing capability, VOCAL-UDF significantly improves query performance across three video datasets.

Problem

Research questions and friction points this paper is trying to address.

Self-enhancing video query system

Automates module construction

Uses LLMs for UDF generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-enhancing video query system

Modular UDFs with LLMs

Active learning for UDF selection

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs