Open-ended Hierarchical Streaming Video Understanding with Vision Language Models

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding methods face limitations in hierarchical modeling, fine-grained temporal annotation, and online streaming processing. To address these challenges, this work introduces the novel task of *hierarchical streaming video understanding*, which jointly performs online temporal action localization and free-form semantic description generation. We innovatively leverage large language models to automatically construct action hierarchies and propose OpenHOUSE—a unified system integrating vision-language models, a lightweight streaming encoder, and a generative decoder to enable real-time action boundary detection and high-level event abstraction. Evaluated on standard benchmarks, our method achieves a 1.9× improvement in action boundary detection accuracy over baseline approaches. To the best of our knowledge, this is the first end-to-end framework enabling hierarchical, generative, and streaming video understanding—establishing a scalable technical pathway for open-domain video analysis.

Technology Category

Application Category

📝 Abstract
We introduce Hierarchical Streaming Video Understanding, a task that combines online temporal action localization with free-form description generation. Given the scarcity of datasets with hierarchical and fine-grained temporal annotations, we demonstrate that LLMs can effectively group atomic actions into higher-level events, enriching existing datasets. We then propose OpenHOUSE (Open-ended Hierarchical Online Understanding System for Events), which extends streaming action perception beyond action classification. OpenHOUSE features a specialized streaming module that accurately detects boundaries between closely adjacent actions, nearly doubling the performance of direct extensions of existing methods. We envision the future of streaming action perception in the integration of powerful generative models, with OpenHOUSE representing a key step in that direction.
Problem

Research questions and friction points this paper is trying to address.

Online temporal action localization with free-form description generation
Grouping atomic actions into higher-level events using LLMs
Accurately detecting boundaries between closely adjacent actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs group atomic actions into events
OpenHOUSE extends streaming action perception
Specialized streaming module detects action boundaries
🔎 Similar Papers
No similar papers found.