🤖 AI Summary
This work addresses the challenge of disentangling visual layers—such as logos and object surfaces—in real-world images, where nonlinear global coupling effects like shadows, reflections, and perspective distortions hinder clean separation. To tackle this, the authors propose an unsupervised hierarchical disentanglement framework built upon a pre-trained diffusion model. The approach employs lightweight LoRA fine-tuning to jointly train decomposition and synthesis modules, incorporates cycle-consistency constraints to enhance disentanglement robustness, and introduces a progressive self-augmentation strategy to iteratively refine training data. Requiring no additional annotations, the method achieves high-fidelity, structurally coherent disentanglement in logo-object separation tasks and demonstrates strong generalization across diverse image layering scenarios, confirming its versatility and effectiveness.
📝 Abstract
Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.