Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenges in image layer decomposition arising from the absence of paired supervision and the reliance on unreliable visual-language model (VLM) scores for policy optimization. To overcome these limitations, the authors propose a reinforcement learning approach built upon the pretrained Qwen-Image-Layered model, employing Flow-GRPO with LoRA-based efficient fine-tuning. A two-stage VLM evaluation mechanism is introduced: first, structured scoring according to five edit-oriented criteria, followed by grid-based juxtaposition for recalibration, which substantially enhances score discriminability and training stability. Evaluated on the Crello dataset, the method produces cleaner layer separation with fewer artifacts and achieves significantly lower single-layer reconstruction error compared to existing baselines.

📝 Abstract

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.

Problem

Research questions and friction points this paper is trying to address.

image layer decomposition

reinforcement learning

vision-language model

unsupervised fine-tuning

reward signal design

Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning

layer decomposition

vision-language model