An Image-like Diffusion Method for Human-Object Interaction Detection

📅 2025-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
HOI detection suffers from severe semantic ambiguity due to variable human poses, diverse object appearances, and occlusion/cluttered backgrounds. This paper introduces a novel diffusion-based paradigm for HOI detection by modeling its output as a semantic image—a first in the field. Our approach comprises three core contributions: (1) a customized diffusion process and scheduling strategy explicitly designed to capture the hierarchical semantic structure of HOI triplets; (2) a slice-based patchification network architecture that jointly encodes local interaction cues and global contextual semantics; and (3) an image-space representation space for HOI outputs, enabling end-to-end generative learning. Evaluated on HICO-DET and V-COCO, our method achieves substantial performance gains over prior state-of-the-art methods. These results demonstrate the effectiveness and generalizability of generative image modeling for structured visual understanding tasks beyond traditional discriminative frameworks.

Technology Category

Application Category

📝 Abstract
Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.
Problem

Research questions and friction points this paper is trying to address.

Detect human-object interactions despite visual ambiguity and variability
Address occlusion and cluttered background challenges in HOI detection
Generate HOI detection outputs as images using diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-like Diffusion for HOI detection
Customized HOI diffusion process
Slice patchification model architecture
🔎 Similar Papers
No similar papers found.