Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces the first truly unified autoregressive multimodal model capable of simultaneously supporting image understanding, text-to-image generation, and image editing—without task-specific adapters or inter-module connectors. To address the challenge of unifying disparate multimodal tasks within a single architecture, the authors propose: (1) a decoupled multi-granularity encoding strategy integrating a masked autoregressive encoder with the SigLIP2 vision encoder; (2) progressive-resolution training coupled with dynamic parameter unfreezing; and (3) a million-scale reward-augmented multitask dataset. The model achieves high-resolution (1024×1024) generation on consumer-grade GPUs (<15 GB VRAM), attaining a GenEval score of 0.86 and 85.5 on DPG-Bench for complex generation—significantly outperforming existing unified architectures. It establishes a new Pareto-optimal trade-off between performance and deployment efficiency.

Technology Category

Application Category

📝 Abstract
We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.
Problem

Research questions and friction points this paper is trying to address.

Unifies image understanding and generation in one model
Achieves high performance with compact multimodal systems
Reduces GPU memory usage for high-resolution image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified autoregressive model for multimodal tasks
Decoupled encoding strategy with shared decoder
Progressive resolution-aware training schedule
🔎 Similar Papers
No similar papers found.
P
Peiyu Wang
Skywork AI
Yi Peng
Yi Peng
Bytedance
Machine LearningImage ProcessingVisualization
Y
Yimeng Gan
Skywork AI
L
Liang Hu
Skywork AI
T
Tianyidan Xie
Skywork AI
Xiaokun Wang
Xiaokun Wang
Nanjing University
Video analytics
Yichen Wei
Yichen Wei
SHUKUN Technology
deep learningcomputer visionmedical image analysis
C
Chuanxin Tang
Skywork AI
B
Bo Zhu
Skywork AI
C
Changshi Li
Skywork AI
H
Hongyang Wei
Skywork AI
E
Eric Li
Skywork AI
Xuchen Song
Xuchen Song
CTO @ Mureka.ai | Head of Multimodality & Spatial AI @ Skywork.ai
Music GenerationMultimodalityMultimodal UnderstandingMultimodal Generation
Y
Yang Liu
Skywork AI
Y
Yahui Zhou
Skywork AI