Person-In-Situ: Scene-Consistent Human Image Insertion with Occlusion-Aware Pose Control

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing methods for inserting human images into scenes struggle with foreground occlusion handling, often placing subjects atop the scene’s frontmost layer and exhibiting limited pose controllability. This paper proposes an occlusion-aware, depth-consistent compositing framework, introducing two novel paradigms: (i) a two-stage synthesis with explicit depth supervision, and (ii) an end-to-end synthesis with implicit occlusion learning. Built upon latent diffusion models, our approach jointly leverages SMPL-driven 3D human pose estimation and scene depth prediction to achieve mask-free, geometrically consistent occlusion-aware compositing. Unlike prior work, it explicitly models depth ordering between the subject and background, enabling physically plausible foreground–background interactions. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in both qualitative and quantitative evaluations. It faithfully realizes user-specified 3D poses while preserving scene depth continuity and ensuring semantically and geometrically valid occlusion relationships.

Technology Category

Application Category

📝 Abstract

Compositing human figures into scene images has broad applications in areas such as entertainment and advertising. However, existing methods often cannot handle occlusion of the inserted person by foreground objects and unnaturally place the person in the frontmost layer. Moreover, they offer limited control over the inserted person's pose. To address these challenges, we propose two methods. Both allow explicit pose control via a 3D body model and leverage latent diffusion models to synthesize the person at a contextually appropriate depth, naturally handling occlusions without requiring occlusion masks. The first is a two-stage approach: the model first learns a depth map of the scene with the person through supervised learning, and then synthesizes the person accordingly. The second method learns occlusion implicitly and synthesizes the person directly from input data without explicit depth supervision. Quantitative and qualitative evaluations show that both methods outperform existing approaches by better preserving scene consistency while accurately reflecting occlusions and user-specified poses.

Problem

Research questions and friction points this paper is trying to address.

Handling occlusion of inserted person by foreground objects

Placing person at contextually appropriate depth naturally

Providing explicit pose control for inserted human figures

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D body model for explicit pose control

Latent diffusion models for depth-aware synthesis

Implicit occlusion learning without masks

🔎 Similar Papers

Replace Anyone in Videos