HOI-aware Adaptive Network for Weakly-supervised Action Segmentation

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the ambiguity arising from semantically similar actions—such as pouring juice versus pouring coffee—in weakly supervised action segmentation. To resolve this challenge, the authors propose a novel approach that leverages video-level human-object interaction (HOI) as contextual prior knowledge. They introduce, for the first time, a temporally global yet spatially local HOI sequence and design a dual-branch hypernetwork coupled with an adaptive temporal encoder. This architecture dynamically adjusts model parameters at test time to accommodate the unique HOI structure of each input video. The method achieves consistent and significant improvements over existing weakly supervised approaches on both the Breakfast and 50Salads datasets across multiple evaluation metrics.

📝 Abstract

In this paper, we propose an HOI-aware adaptive network named AdaAct for weakly-supervised action segmentation. Most existing methods learn a fixed network to predict the action of each frame with the neighboring frames. However, this would result in ambiguity when estimating similar actions, such as pouring juice and pouring coffee. To address this, we aim to exploit temporally global but spatially local human-object interactions (HOI) as video-level prior knowledge for action segmentation. The long-term HOI sequence provides crucial contextual information to distinguish ambiguous actions, where our network dynamically adapts to the given HOI sequence at test time. More specifically, we first design a video HOI encoder that extracts, selects, and integrates the most representative HOI throughout the video. Then, we propose a two-branch HyperNetwork to learn an adaptive temporal encoder, which automatically adjusts the parameters based on the HOI information of various videos on the fly. Extensive experiments on two widely-used datasets including Breakfast and 50Salads demonstrate the effectiveness of our method under different evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

weakly-supervised action segmentation

human-object interaction

action ambiguity

temporal action segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

HOI-aware

adaptive network

weakly-supervised action segmentation

HyperNetwork