🤖 AI Summary
Existing video understanding methods face limitations in hierarchical modeling, fine-grained temporal annotation, and online streaming processing. To address these challenges, this work introduces the novel task of *hierarchical streaming video understanding*, which jointly performs online temporal action localization and free-form semantic description generation. We innovatively leverage large language models to automatically construct action hierarchies and propose OpenHOUSE—a unified system integrating vision-language models, a lightweight streaming encoder, and a generative decoder to enable real-time action boundary detection and high-level event abstraction. Evaluated on standard benchmarks, our method achieves a 1.9× improvement in action boundary detection accuracy over baseline approaches. To the best of our knowledge, this is the first end-to-end framework enabling hierarchical, generative, and streaming video understanding—establishing a scalable technical pathway for open-domain video analysis.
📝 Abstract
We introduce Hierarchical Streaming Video Understanding, a task that combines online temporal action localization with free-form description generation. Given the scarcity of datasets with hierarchical and fine-grained temporal annotations, we demonstrate that LLMs can effectively group atomic actions into higher-level events, enriching existing datasets. We then propose OpenHOUSE (Open-ended Hierarchical Online Understanding System for Events), which extends streaming action perception beyond action classification. OpenHOUSE features a specialized streaming module that accurately detects boundaries between closely adjacent actions, nearly doubling the performance of direct extensions of existing methods. We envision the future of streaming action perception in the integration of powerful generative models, with OpenHOUSE representing a key step in that direction.