HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Understanding human behavior in egocentric videos remains challenging due to poor generalization—especially for unseen action compositions—stemming from weak semantic grounding and lack of structured behavioral priors. Method: We propose a weakly supervised learning framework that explicitly models implicit behavioral hierarchies (e.g., actions → substeps → goals) automatically discovered from unscripted, untrimmed videos. This hierarchy serves as a structural prior injected into video representations. Leveraging weak supervision from video-clip–narrative-text alignment, our approach jointly models contextual, semantic, and temporal hierarchical reasoning via multi-granularity behavioral thread embeddings and cross-modal feature enhancement. Contribution/Results: Our method achieves state-of-the-art performance on both alignment tasks (EgoMCQ, EgoNLQ) and zero-shot procedural understanding benchmarks (EgoProceL, Ego4D Goal-Step). Notably, it improves zero-shot F1 by 12.5% on EgoProceL, demonstrating significantly enhanced generalization to novel behavioral combinations.

Technology Category

Application Category

📝 Abstract

Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.

Problem

Research questions and friction points this paper is trying to address.

Understanding hierarchical human behavior in egocentric videos

Enhancing reasoning with weakly-supervised hierarchical activity threads

Improving video-text alignment and procedure learning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical activity threads enrich video features

Weakly-supervised alignment with narrated descriptions

State-of-the-art zero-shot procedure learning performance

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models