Spatial-Temporal Human-Object Interaction Detection

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of jointly modeling fine-grained human–object interactions (HOIs) and spatiotemporal trajectories of humans and objects in videos. To this end, we formulate a novel task—instance-level spatiotemporal HOI detection (ST-HOID). To support this task, we introduce VidOR-HOID, the first large-scale benchmark dataset featuring 10,831 precisely annotated spatiotemporal HOI instances. Methodologically, we propose a dual-module framework integrating object trajectory detection and interaction reasoning, combining instance-level temporal modeling with relational inference. Extensive experiments demonstrate that our approach achieves significant improvements over state-of-the-art methods—including image-based HOI detectors, video visual relationship models, and prior video HOI recognition systems—across multiple video HOI and visual relationship benchmarks. This work advances fine-grained, human-centric video understanding by unifying trajectory tracking and interaction semantics at the instance level.

Technology Category

Application Category

📝 Abstract
In this paper, we propose a new instance-level human-object interaction detection task on videos called ST-HOID, which aims to distinguish fine-grained human-object interactions (HOIs) and the trajectories of subjects and objects. It is motivated by the fact that HOI is crucial for human-centric video content understanding. To solve ST-HOID, we propose a novel method consisting of an object trajectory detection module and an interaction reasoning module. Furthermore, we construct the first dataset named VidOR-HOID for ST-HOID evaluation, which contains 10,831 spatial-temporal HOI instances. We conduct extensive experiments to evaluate the effectiveness of our method. The experimental results demonstrate that our method outperforms the baselines generated by the state-of-the-art methods of image human-object interaction detection, video visual relation detection and video human-object interaction recognition.
Problem

Research questions and friction points this paper is trying to address.

Detecting fine-grained human-object interactions in videos
Tracking trajectories of subjects and objects spatially-temporally
Advancing human-centric video content understanding through HOI analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object trajectory detection module
Interaction reasoning module for HOIs
VidOR-HOID dataset for evaluation
🔎 Similar Papers
2024-08-202024 2nd International Conference on Computer, Vision and Intelligent Technology (ICCVIT)Citations: 2
X
Xu Sun
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Y
Yunqing He
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Tongwei Ren
Tongwei Ren
Nanjing University
multimedia computing
G
Gangshan Wu
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China