🤖 AI Summary
This work addresses the challenge of jointly modeling fine-grained human–object interactions (HOIs) and spatiotemporal trajectories of humans and objects in videos. To this end, we formulate a novel task—instance-level spatiotemporal HOI detection (ST-HOID). To support this task, we introduce VidOR-HOID, the first large-scale benchmark dataset featuring 10,831 precisely annotated spatiotemporal HOI instances. Methodologically, we propose a dual-module framework integrating object trajectory detection and interaction reasoning, combining instance-level temporal modeling with relational inference. Extensive experiments demonstrate that our approach achieves significant improvements over state-of-the-art methods—including image-based HOI detectors, video visual relationship models, and prior video HOI recognition systems—across multiple video HOI and visual relationship benchmarks. This work advances fine-grained, human-centric video understanding by unifying trajectory tracking and interaction semantics at the instance level.
📝 Abstract
In this paper, we propose a new instance-level human-object interaction detection task on videos called ST-HOID, which aims to distinguish fine-grained human-object interactions (HOIs) and the trajectories of subjects and objects. It is motivated by the fact that HOI is crucial for human-centric video content understanding. To solve ST-HOID, we propose a novel method consisting of an object trajectory detection module and an interaction reasoning module. Furthermore, we construct the first dataset named VidOR-HOID for ST-HOID evaluation, which contains 10,831 spatial-temporal HOI instances. We conduct extensive experiments to evaluate the effectiveness of our method. The experimental results demonstrate that our method outperforms the baselines generated by the state-of-the-art methods of image human-object interaction detection, video visual relation detection and video human-object interaction recognition.