Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the limitations of existing video human-object interaction (HOI) understanding methods, which treat detection and future prediction as disjoint tasks and rely on sparse keyframe annotations, leading to temporal misalignment and evaluation bias. To overcome these issues, the authors propose HOI-DA, a novel framework that unifies current HOI detection and future prediction into a single joint learning task under structured constraints. Centered on human-object pairs, HOI-DA introduces a residual state transition mechanism and a set-based prediction model with temporal correction. The study also establishes DETAnt-HOI, the first temporally aligned benchmark featuring multi-granularity temporal annotations for comprehensive evaluation. Experiments demonstrate that the proposed method significantly outperforms existing approaches in both detection and multi-step prediction, with particularly notable gains in long-horizon forecasting scenarios.

Technology Category

Application Category

📝 Abstract

Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

human-object interaction

video understanding

anticipation

detection

temporal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

human-object interaction

temporal alignment

joint detection and anticipation