Referring Layer Decomposition

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image editing methods struggle to accurately isolate and manipulate specific objects within a scene. This work introduces a novel task termed Referring Layer Decomposition, which predicts full RGBA layers from a single RGB image conditioned on user-provided referring cues—such as points, bounding boxes, or natural language—to enable object-level controllable decomposition and editing. To support this task, we construct RefLade, a large-scale dataset comprising 1.11 million image-layer-prompt triplets, establish an evaluation protocol aligned with human preferences, and propose RefLayer, a strong baseline model. Experiments demonstrate that RefLayer achieves superior performance in both visual fidelity and semantic alignment, enabling effective training, reliable evaluation, and strong zero-shot generalization capabilities.

Technology Category

Application Category

📝 Abstract
Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image-layer-prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities.
Problem

Research questions and friction points this paper is trying to address.

layer decomposition
object-aware editing
prompt-conditioned generation
compositional image understanding
RGBA layer prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Referring Layer Decomposition
layered representation
prompt-conditioned generation
image decomposition
zero-shot generalization
🔎 Similar Papers
No similar papers found.