🤖 AI Summary
This work addresses the limitations of existing video human-object interaction (HOI) understanding methods, which treat detection and future prediction as disjoint tasks and rely on sparse keyframe annotations, leading to temporal misalignment and evaluation bias. To overcome these issues, the authors propose HOI-DA, a novel framework that unifies current HOI detection and future prediction into a single joint learning task under structured constraints. Centered on human-object pairs, HOI-DA introduces a residual state transition mechanism and a set-based prediction model with temporal correction. The study also establishes DETAnt-HOI, the first temporally aligned benchmark featuring multi-granularity temporal annotations for comprehensive evaluation. Experiments demonstrate that the proposed method significantly outperforms existing approaches in both detection and multi-step prediction, with particularly notable gains in long-horizon forecasting scenarios.
📝 Abstract
Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.