Spatial-Temporal Human-Object Interaction Detection

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenge of jointly modeling fine-grained human–object interactions (HOIs) and spatiotemporal trajectories of humans and objects in videos. To this end, we formulate a novel task—instance-level spatiotemporal HOI detection (ST-HOID). To support this task, we introduce VidOR-HOID, the first large-scale benchmark dataset featuring 10,831 precisely annotated spatiotemporal HOI instances. Methodologically, we propose a dual-module framework integrating object trajectory detection and interaction reasoning, combining instance-level temporal modeling with relational inference. Extensive experiments demonstrate that our approach achieves significant improvements over state-of-the-art methods—including image-based HOI detectors, video visual relationship models, and prior video HOI recognition systems—across multiple video HOI and visual relationship benchmarks. This work advances fine-grained, human-centric video understanding by unifying trajectory tracking and interaction semantics at the instance level.

Technology Category

Application Category

📝 Abstract

In this paper, we propose a new instance-level human-object interaction detection task on videos called ST-HOID, which aims to distinguish fine-grained human-object interactions (HOIs) and the trajectories of subjects and objects. It is motivated by the fact that HOI is crucial for human-centric video content understanding. To solve ST-HOID, we propose a novel method consisting of an object trajectory detection module and an interaction reasoning module. Furthermore, we construct the first dataset named VidOR-HOID for ST-HOID evaluation, which contains 10,831 spatial-temporal HOI instances. We conduct extensive experiments to evaluate the effectiveness of our method. The experimental results demonstrate that our method outperforms the baselines generated by the state-of-the-art methods of image human-object interaction detection, video visual relation detection and video human-object interaction recognition.

Problem

Research questions and friction points this paper is trying to address.

Detecting fine-grained human-object interactions in videos

Tracking trajectories of subjects and objects spatially-temporally

Advancing human-centric video content understanding through HOI analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object trajectory detection module

Interaction reasoning module for HOIs

VidOR-HOID dataset for evaluation

🔎 Similar Papers

A Review of Human-Object Interaction Detection