🤖 AI Summary
This work addresses the challenges in image layer decomposition arising from the absence of paired supervision and the reliance on unreliable visual-language model (VLM) scores for policy optimization. To overcome these limitations, the authors propose a reinforcement learning approach built upon the pretrained Qwen-Image-Layered model, employing Flow-GRPO with LoRA-based efficient fine-tuning. A two-stage VLM evaluation mechanism is introduced: first, structured scoring according to five edit-oriented criteria, followed by grid-based juxtaposition for recalibration, which substantially enhances score discriminability and training stability. Evaluated on the Crello dataset, the method produces cleaner layer separation with fewer artifacts and achieves significantly lower single-layer reconstruction error compared to existing baselines.
📝 Abstract
We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.