🤖 AI Summary
Current amodal completion methods suffer from strong data dependency, poor generalization, and error accumulation in progressive inpainting. To address these issues, this paper proposes a multi-agent collaborative reasoning framework that jointly models occlusion relationships and boundary expansion to achieve precise mask restoration and semantically consistent image synthesis. We innovatively introduce a fine-grained semantic-guided mechanism coupled with a Diffusion Transformer-driven attention–visible-mask joint guidance scheme, effectively preventing erroneous redrawing of occluders. Moreover, our method directly outputs hierarchical RGBA representations, eliminating the need for post-hoc segmentation. Extensive experiments demonstrate state-of-the-art visual quality across multiple benchmarks, with particularly notable improvements in structural coherence and semantic fidelity for large-occluded regions.
📝 Abstract
Amodal completion, generating invisible parts of occluded objects, is vital for applications like image editing and AR. Prior methods face challenges with data needs, generalization, or error accumulation in progressive pipelines. We propose a Collaborative Multi-Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues. Our framework uses multiple agents to collaboratively analyze occlusion relationships and determine necessary boundary expansion, yielding a precise mask for inpainting. Concurrently, an agent generates fine-grained textual descriptions, enabling Fine-Grained Semantic Guidance. This ensures accurate object synthesis and prevents the regeneration of occluders or other unwanted elements, especially within large inpainting areas. Furthermore, our method directly produces layered RGBA outputs guided by visible masks and attention maps from a Diffusion Transformer, eliminating extra segmentation. Extensive evaluations demonstrate our framework achieves state-of-the-art visual quality.