🤖 AI Summary
Existing image-guided composition methods struggle to simultaneously ensure interaction plausibility and appearance consistency in human-object interaction (HOI) scenarios. To address this, we propose HOComp—the first HOI-aware compositional generation framework—and introduce the IHOC benchmark dataset. Methodologically, HOComp innovatively integrates a multimodal large language model (MLLM)-driven, region-wise coarse-to-fine pose guidance (MRPG) with a detail-consistent appearance preservation mechanism, incorporating shape-aware attention modulation, multi-view appearance loss, background consistency loss, and human keypoint constraints. Extensive experiments on IHOC demonstrate that HOComp significantly improves both action plausibility and visual realism of generated compositions. Quantitative evaluations and qualitative analyses consistently outperform state-of-the-art methods, validating its effectiveness in synthesizing semantically coherent and visually faithful human-object interactions.
📝 Abstract
While existing image-guided composition methods may help insert a foreground object onto a user-specified region of a background image, achieving natural blending inside the region with the rest of the image unchanged, we observe that these existing methods often struggle in synthesizing seamless interaction-aware compositions when the task involves human-object interactions. In this paper, we first propose HOComp, a novel approach for compositing a foreground object onto a human-centric background image, while ensuring harmonious interactions between the foreground object and the background person and their consistent appearances. Our approach includes two key designs: (1) MLLMs-driven Region-based Pose Guidance (MRPG), which utilizes MLLMs to identify the interaction region as well as the interaction type (e.g., holding and lefting) to provide coarse-to-fine constraints to the generated pose for the interaction while incorporating human pose landmarks to track action variations and enforcing fine-grained pose constraints; and (2) Detail-Consistent Appearance Preservation (DCAP), which unifies a shape-aware attention modulation mechanism, a multi-view appearance loss, and a background consistency loss to ensure consistent shapes/textures of the foreground and faithful reproduction of the background human. We then propose the first dataset, named Interaction-aware Human-Object Composition (IHOC), for the task. Experimental results on our dataset show that HOComp effectively generates harmonious human-object interactions with consistent appearances, and outperforms relevant methods qualitatively and quantitatively.