🤖 AI Summary
Existing diffusion-based methods for amodal completion of occluded objects in human-object interaction (HOI) scenes suffer from inadequate modeling of interaction structure and physical constraints, leading to geometrically implausible and semantically inconsistent reconstructions of invisible regions. To address this, we propose a multi-region diffusion inpainting framework that integrates human topological priors and contact-aware guidance. Specifically, we first partition occluded regions into primary and secondary areas using a contact heatmap; then, during denoising, we impose human skeletal constraints and physics-informed contact guidance to enhance structural understanding of HOI. Our method requires no ground-truth contact annotations, significantly improving completion plausibility and robustness. Evaluated on multiple HOI benchmarks, it outperforms state-of-the-art approaches. Moreover, the generated amodal completions directly support downstream 3D reconstruction and novel-view synthesis tasks.
📝 Abstract
Amodal completion, which is the process of inferring the full appearance of objects despite partial occlusions, is crucial for understanding complex human-object interactions (HOI) in computer vision and robotics. Existing methods, such as those that use pre-trained diffusion models, often struggle to generate plausible completions in dynamic scenarios because they have a limited understanding of HOI. To solve this problem, we've developed a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI. By incorporating physical constraints from human topology and contact information, we define two distinct regions: the primary region, where occluded object parts are most likely to be, and the secondary region, where occlusions are less probable. Our multi-regional inpainting method uses customized denoising strategies across these regions within a diffusion model. This improves the accuracy and realism of the generated completions in both their shape and visual detail. Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios, moving machine perception closer to a more human-like understanding of dynamic environments. We also show that our pipeline is robust even without ground-truth contact annotations, which broadens its applicability to tasks like 3D reconstruction and novel view/pose synthesis.