EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses temporal action detection in untrimmed, first-person long videos by proposing a decoupled verb-noun joint localization framework. The approach trains separate causal temporal detectors for verbs and nouns, generating action proposals via sliding-window inference. A dynamic weighted fusion (DWF) mechanism adaptively assigns boundary weighting based on the classification confidence of each stream, thereby delegating localization dominance to the more reliable modality. Leveraging EPIC-finetuned VideoMAE-L features, class-aware Soft-NMS, and a Top-K combination strategy, the system achieves significant improvements in both action boundary localization accuracy and label prediction performance on EPIC-KITCHENS-100, while maintaining lightweight design and reproducibility.

📝 Abstract

The EPIC-KITCHENS-100 Action Detection challenge evaluates whether a model can localize the start and end of each action in long untrimmed egocentric videos and assign the corresponding verb--noun action label. In this report, we formulate our submission as EgoAction (Egocentric Action Composition with Reliability-Aware Temporal Fusion), a unified decoupled detection and fusion pipeline. The pipeline uses EPIC-finetuned VideoMAE-L features, trains separate noun and verb temporal detectors with causal temporal modeling, composes action hypotheses from top noun--verb pairs, and introduces a confidence-adaptive boundary fusion rule at post-processing time. The key observation is that verb and noun streams often fail differently: verb scores are sensitive to motion transitions, whereas noun scores are sensitive to hand-object visibility and object clutter. A fixed arithmetic mean of their predicted boundaries can therefore amplify localization errors when one stream degenerates. We replace this hard-coded mean with Dynamic Weighted Fusion (DWF), which normalizes the maximum noun and verb classification confidences into proposal-wise boundary weights and linearly combines the two intervals. This lightweight tensor-only operator shifts boundary authority toward the more reliable stream while preserving the decoupled action scoring mechanism. Together with sliding-window inference, top-K noun--verb action composition, and class-wise Soft-NMS, EgoAction provides a compact and reproducible system for egocentric temporal action detection.

Problem

Research questions and friction points this paper is trying to address.

egocentric video

temporal action detection

action localization

verb-noun action labeling

EPIC-KITCHENS

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Weighted Fusion

Egocentric Action Detection

Decoupled Verb-Noun Modeling