Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing open-source multimodal models predominantly rely on parameter scaling while neglecting optimization of training strategies, resulting in suboptimal efficiency and performance. To address this, we propose UniPic2—a unified framework comprising two key innovations: (1) UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, integrating large-scale contrastive pretraining with a novel Progressive Dual-Task Reinforcement (PDTR) online reinforcement learning strategy to enable joint, progressive optimization of generation and editing tasks without negative transfer; and (2) MetaQuery, a lightweight bridging architecture that unifies Qwen2.5-VL-7B (for perception and reasoning) with the DiT backbone (for generation/editing). Experiments demonstrate that UniPic2-SD3.5M-Kontext outperforms BAGEL (7B) and Flux-Kontext (12B) on image generation and editing, while UniPic2-MetaQuery achieves state-of-the-art performance across multimodal understanding, generation, and editing benchmarks—validating the effectiveness and generalizability of our lightweight, efficient, and scalable training paradigm.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.

Problem

Research questions and friction points this paper is trying to address.

Optimizing training strategies for multimodal models

Enhancing instruction following and editing consistency

Achieving unified multimodal understanding and generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

2B-parameter DiT model based on SD3.5-Medium

Progressive Dual-Task Reinforcement strategy (PDTR)

Unified multimodal model integrating understanding generation editing

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs