Open-ended Hierarchical Streaming Video Understanding with Vision Language Models

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing video understanding methods face limitations in hierarchical modeling, fine-grained temporal annotation, and online streaming processing. To address these challenges, this work introduces the novel task of *hierarchical streaming video understanding*, which jointly performs online temporal action localization and free-form semantic description generation. We innovatively leverage large language models to automatically construct action hierarchies and propose OpenHOUSE—a unified system integrating vision-language models, a lightweight streaming encoder, and a generative decoder to enable real-time action boundary detection and high-level event abstraction. Evaluated on standard benchmarks, our method achieves a 1.9× improvement in action boundary detection accuracy over baseline approaches. To the best of our knowledge, this is the first end-to-end framework enabling hierarchical, generative, and streaming video understanding—establishing a scalable technical pathway for open-domain video analysis.

Technology Category

Application Category

📝 Abstract

We introduce Hierarchical Streaming Video Understanding, a task that combines online temporal action localization with free-form description generation. Given the scarcity of datasets with hierarchical and fine-grained temporal annotations, we demonstrate that LLMs can effectively group atomic actions into higher-level events, enriching existing datasets. We then propose OpenHOUSE (Open-ended Hierarchical Online Understanding System for Events), which extends streaming action perception beyond action classification. OpenHOUSE features a specialized streaming module that accurately detects boundaries between closely adjacent actions, nearly doubling the performance of direct extensions of existing methods. We envision the future of streaming action perception in the integration of powerful generative models, with OpenHOUSE representing a key step in that direction.

Problem

Research questions and friction points this paper is trying to address.

Online temporal action localization with free-form description generation

Grouping atomic actions into higher-level events using LLMs

Accurately detecting boundaries between closely adjacent actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs group atomic actions into events

OpenHOUSE extends streaming action perception

Specialized streaming module detects action boundaries

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives