Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Transformer-based methods for human-object interaction (HOI) detection lack explicit relational modeling, limiting their ability to accurately recognize complex interactions. Method: This paper proposes the first multimodal graph modeling paradigm tailored for HOI detection. It constructs a four-stage graph structure—comprising human, object, interaction, and scene nodes—and integrates hierarchical visual features with linguistic semantics. A graph neural network (GNN) drives cross-modal feature interaction and information propagation across the graph. Crucially, linguistic priors are incorporated into the graph reasoning process to improve balanced recognition of both rare and frequent interaction classes. Results: The method achieves state-of-the-art performance on HICO-DET and V-COCO benchmarks. Notably, when integrated with advanced object detectors, it yields significant gains in average precision (AP) for rare interaction categories, demonstrating superior generalization and robustness.

Technology Category

Application Category

📝 Abstract
Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose extbf{M}ultimodal extbf{G}raph extbf{N}etwork extbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level vision and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.
Problem

Research questions and friction points this paper is trying to address.

Explicitly modeling relational structures for HOI detection
Leveraging multimodal graph networks for interaction recognition
Enhancing information propagation across human-object pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph Neural Networks for HOI detection
Multimodal graph network with four-stage structure
Multi-level feature interaction mechanism
🔎 Similar Papers
No similar papers found.
W
Wenxuan Ji
Institute of Information Engineering, Chinese Academy of Sciences
Haichao Shi
Haichao Shi
Institute of Information Engineering,Chinese Academy of Sciences
X
Xiao-Yu zhang
Institute of Information Engineering, Chinese Academy of Sciences