Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of multi-task, multi-temporal-granularity understanding of human activities in first-person videos. Methodologically, it introduces a unified modeling framework featuring a hierarchical temporal reasoning architecture that jointly leverages frame-level and clip-level representations. The framework comprises a hierarchical Transformer encoder, a task-adaptive graph neural network (GNN), and a multi-granularity feature alignment module, enabling joint learning of action recognition, ego-object interaction parsing, and short-term event prediction. Its key innovation lies in cross-granularity concept alignment and knowledge transfer mechanisms. Evaluated on multiple clip-level and frame-level benchmarks from Ego4d, the approach achieves state-of-the-art performance. Moreover, multi-task joint training improves few-shot generalization to novel tasks by 12.7%, demonstrating enhanced transferability and robustness.

Technology Category

Application Category

📝 Abstract
Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.
Problem

Research questions and friction points this paper is trying to address.

Enhance video understanding across tasks.
Enable multi-granularity temporal reasoning.
Improve skill learning efficiency autonomously.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical architecture for temporal reasoning
GNN layer for multi-granularity reasoning
Unified framework for diverse task understanding
🔎 Similar Papers
No similar papers found.