🤖 AI Summary
Single-image reflection removal is highly ill-posed due to the lack of semantic structure in latent spaces, hindering accurate decomposition of composite images and limiting generalization. We identify that standard encoder latent spaces fail to support physically consistent linear superposition of reflection and transmission layers. To address this, we propose a reflection-equivariant VAE that constructs an optically grounded, structured latent space. Further, we introduce task-adaptive text embeddings and depth-guided early-branching sampling to enable precise hierarchical component modeling. Our method is the first to explicitly incorporate reflection physical priors into the latent-space design of diffusion models. Extensive experiments demonstrate state-of-the-art performance on benchmarks including SOTS and REI. Moreover, our approach exhibits strong robustness and generalization under complex real-world conditions.
📝 Abstract
Single-image reflection removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions, causing them to fail at recovery and generalization in the wild. This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs, yielding high-quality outputs. We argue that the challenge of this conversion stems from a critical yet overlooked issue, i.e., the latent space of semantic encoders lacks the inherent structure to interpret a composite image as a linear superposition of its constituent layers. Our approach is built on three synergistic components, including a reflection-equivariant VAE that aligns the latent space with the linear physics of reflection formation, a learnable task-specific text embedding for precise guidance that bypasses ambiguous language, and a depth-guided early-branching sampling strategy to harness generative stochasticity for promising results. Extensive experiments reveal that our model achieves new SOTA performance on multiple benchmarks and generalizes well to challenging real-world cases.