UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of maintaining cross-reference consistency in multi-reference image editing with diffusion models, which often suffer from insufficient interaction among reference images. To this end, the authors propose a unified generative framework that integrates single-image editing and multi-image synthesis. Central to this approach is the novel Sequential Latent Fusion (SELF) representation, which encodes multiple reference images into a unified latent sequence. The framework further leverages supervised fine-tuning (SFT) and progressive sequence-length training to enhance output fidelity. Additionally, a GRPO-based reinforcement learning strategy tailored for multi-source references is introduced to significantly improve visual consistency and detail preservation in edited results. The code, models, and data will be publicly released.

Technology Category

Application Category

📝 Abstract
We present UniRef-Image-Edit, a high-performance multi-modal generation system that unifies single-image editing and multi-image composition within a single framework. Existing diffusion-based editing methods often struggle to maintain consistency across multiple conditions due to limited interaction between reference inputs. To address this, we introduce Sequence-Extended Latent Fusion (SELF), a unified input representation that dynamically serializes multiple reference images into a coherent latent sequence. During a dedicated training stage, all reference images are jointly constrained to fit within a fixed-length sequence under a global pixel-budget constraint. Building upon SELF, we propose a two-stage training framework comprising supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we jointly train on single-image editing and multi-image composition tasks to establish a robust generative prior. We adopt a progressive sequence length training strategy, in which all input images are initially resized to a total pixel budget of $1024^2$, and are then gradually increased to $1536^2$ and $2048^2$ to improve visual fidelity and cross-reference consistency. This gradual relaxation of compression enables the model to incrementally capture finer visual details while maintaining stable alignment across references. For the RL stage, we introduce Multi-Source GRPO (MSGRPO), to our knowledge the first reinforcement learning framework tailored for multi-reference image generation. MSGRPO optimizes the model to reconcile conflicting visual constraints, significantly enhancing compositional consistency. We will open-source the code, models, training data, and reward data for community research purposes.
Problem

Research questions and friction points this paper is trying to address.

multi-reference image editing
consistency
diffusion models
visual alignment
image composition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequence-Extended Latent Fusion
Multi-Reference Image Editing
Progressive Sequence Length Training
Multi-Source GRPO
Two-Stage Training Framework
🔎 Similar Papers
No similar papers found.
H
Hongyang Wei
Tsinghua University
Bin Wen
Bin Wen
快手
MLLM
Y
Yancheng Long
Harbin Institute of Technology, Shenzhen
Y
Yankai Yang
Harbin Institute of Technology, Shenzhen
Y
Yuhang Hu
Kuaishou Technology
Tianke Zhang
Tianke Zhang
Tsinghua University; Kuaishou Technology
Computer VisionNeuro-Linguistic Programming
W
Wei Chen
Kuaishou Technology
H
Haonan Fan
Kuaishou Technology
Kaiyu Jiang
Kaiyu Jiang
Kuaishou
MLLM
J
Jiankang Chen
Kuaishou Technology
C
Changyi Liu
Kuaishou Technology
K
Kaiyu Tang
Kuaishou Technology
H
Haojie Ding
Kuaishou Technology
X
Xiao Yang
Kuaishou Technology
Jia Sun
Jia Sun
Hong Kong University of Science and Technology (Guangzhou)
Media arts
H
Huaiqing Wang
Kuaishou Technology
Z
Zhenyu Yang
Kuaishou Technology
Xinyu Wei
Xinyu Wei
PolyU & PKU
Computer VisionDeep Learning
Xianglong He
Xianglong He
Tsinghua University
3D GenerationVideo GenerationMLLM
Yangguang Li
Yangguang Li
CUHK
GenAIComputer GraphicsComputer Vision
F
Fan Yang
Kuaishou Technology
T
Tingting Gao
Kuaishou Technology
Lei Zhang
Lei Zhang
Chair Professor, Dept. of Computing, The Hong Kong Polytechnic University
Computer VisionImage ProcessingPattern RecognitionMachine Learning
Guorui Zhou
Guorui Zhou
Unknown affiliation
Recommender System,Advertising,Artificial Intelligence,Machine Learning,NLP
H
Han Li
Kuaishou Technology