🤖 AI Summary
This paper addresses the single-image foreground-background layer decomposition problem. The proposed method leverages a lightweight diffusion-based layer disentanglement framework. Methodologically, it (1) adapts a pre-trained generative inpainting model via light-weight fine-tuning for layer separation; (2) introduces a multimodal context fusion module that models cross-layer semantic dependencies in latent space using linear-complexity attention, thereby preserving fine-grained details; and (3) constructs a high-quality synthetic dataset enabling end-to-end training. Experimental results demonstrate substantial improvements over state-of-the-art approaches on object removal and occlusion recovery tasks. The method yields high-fidelity, editable layered representations, facilitating downstream applications such as image editing and content creation.
📝 Abstract
Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.