🤖 AI Summary
Existing methods for inserting human images into scenes struggle with foreground occlusion handling, often placing subjects atop the scene’s frontmost layer and exhibiting limited pose controllability. This paper proposes an occlusion-aware, depth-consistent compositing framework, introducing two novel paradigms: (i) a two-stage synthesis with explicit depth supervision, and (ii) an end-to-end synthesis with implicit occlusion learning. Built upon latent diffusion models, our approach jointly leverages SMPL-driven 3D human pose estimation and scene depth prediction to achieve mask-free, geometrically consistent occlusion-aware compositing. Unlike prior work, it explicitly models depth ordering between the subject and background, enabling physically plausible foreground–background interactions. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in both qualitative and quantitative evaluations. It faithfully realizes user-specified 3D poses while preserving scene depth continuity and ensuring semantically and geometrically valid occlusion relationships.
📝 Abstract
Compositing human figures into scene images has broad applications in areas such as entertainment and advertising. However, existing methods often cannot handle occlusion of the inserted person by foreground objects and unnaturally place the person in the frontmost layer. Moreover, they offer limited control over the inserted person's pose. To address these challenges, we propose two methods. Both allow explicit pose control via a 3D body model and leverage latent diffusion models to synthesize the person at a contextually appropriate depth, naturally handling occlusions without requiring occlusion masks. The first is a two-stage approach: the model first learns a depth map of the scene with the person through supervised learning, and then synthesizes the person accordingly. The second method learns occlusion implicitly and synthesizes the person directly from input data without explicit depth supervision. Quantitative and qualitative evaluations show that both methods outperform existing approaches by better preserving scene consistency while accurately reflecting occlusions and user-specified poses.