MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low erasure accuracy and incomplete background restoration in complex multi-identity-person (multi-IP) scenarios—such as severe human occlusion, person-object entanglement, and camouflaged backgrounds—this paper proposes a semantic-decoupled multi-level diffusion framework. Methodologically, it introduces three key innovations: (1) the first large-scale, high-diversity multi-IP portrait erasure dataset featuring strong occlusion and rich inter-person interactions; (2) a spatially modulated attention mechanism guided by human pose estimation and semantic parsing, enabling precise spatial decoupling of foreground instances; and (3) a multi-path generative architecture that hierarchically models semantic and geometric priors throughout the diffusion process. Extensive experiments on multiple challenging benchmarks demonstrate substantial improvements over state-of-the-art methods, achieving significant gains in erasure accuracy and visual fidelity under complex real-world conditions.

Technology Category

Application Category

📝 Abstract
Recent years have witnessed the success of diffusion models in image-customized tasks. Prior works have achieved notable progress on human-oriented erasing using explicit mask guidance and semantic-aware inpainting. However, they struggle under complex multi-IP scenarios involving human-human occlusions, human-object entanglements, and background interferences. These challenges are mainly due to: 1) Dataset limitations, as existing datasets rarely cover dense occlusions, camouflaged backgrounds, and diverse interactions; 2) Lack of spatial decoupling, where foreground instances cannot be effectively disentangled, limiting clean background restoration. In this work, we introduce a high-quality multi-IP human erasing dataset with diverse pose variations and complex backgrounds. We then propose Multi-Layer Diffusion (MILD), a novel strategy that decomposes generation into semantically separated pathways for each instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, integrating pose, parsing, and spatial relations. We further present Spatially-Modulated Attention to better guide attention flow. Extensive experiments show that MILD outperforms state-of-the-art methods on challenging human erasing benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Handles human erasing in complex multi-IP scenarios
Addresses dataset gaps in occlusions and interactions
Improves spatial decoupling for clean background restoration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Layer Diffusion for instance-background decoupling
Human Morphology Guidance with pose and parsing
Spatially-Modulated Attention for precise attention flow
🔎 Similar Papers
No similar papers found.