🤖 AI Summary
HOI detection suffers from severe semantic ambiguity due to variable human poses, diverse object appearances, and occlusion/cluttered backgrounds. This paper introduces a novel diffusion-based paradigm for HOI detection by modeling its output as a semantic image—a first in the field. Our approach comprises three core contributions: (1) a customized diffusion process and scheduling strategy explicitly designed to capture the hierarchical semantic structure of HOI triplets; (2) a slice-based patchification network architecture that jointly encodes local interaction cues and global contextual semantics; and (3) an image-space representation space for HOI outputs, enabling end-to-end generative learning. Evaluated on HICO-DET and V-COCO, our method achieves substantial performance gains over prior state-of-the-art methods. These results demonstrate the effectiveness and generalizability of generative image modeling for structured visual understanding tasks beyond traditional discriminative frameworks.
📝 Abstract
Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.