Empowering Agentic Video Analytics Systems with Video Language Models

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current AI video analysis systems are constrained by closed-task paradigms and the short context windows of vision-language models (VLMs), hindering real-time understanding and reasoning over open-domain, ultra-long videos (e.g., hour-scale). To address this, we propose AVA, the first system enabling online construction of Event Knowledge Graphs (EKGs) and introducing an agent-style retrieval-augmented generation framework grounded in EKGs. Our method integrates multi-stage spatiotemporal feature alignment and dynamic summary compression to achieve efficient indexing and complex question answering over ultra-long videos. By unifying VLMs, retrieval-augmented generation (RAG), and EKGs, AVA overcomes dual bottlenecks—task openness and context length—that limit prior systems. On LVBench and VideoMME-Long, AVA achieves 62.3% and 64.1% accuracy, respectively; on our newly curated long-video benchmark AVA-100 (eight videos >10 hours each), it attains 75.8%, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%.

Problem

Research questions and friction points this paper is trying to address.

Enabling open-ended video understanding and analytics with VLMs

Overcoming limited context windows for ultra-long video processing

Handling complex queries in real-world video analytics scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Video-Language Models for open-ended analytics

Implements Event Knowledge Graphs for video indexing

Employs agentic retrieval-generation for complex queries

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives