🤖 AI Summary
This work introduces the first truly unified autoregressive multimodal model capable of simultaneously supporting image understanding, text-to-image generation, and image editing—without task-specific adapters or inter-module connectors. To address the challenge of unifying disparate multimodal tasks within a single architecture, the authors propose: (1) a decoupled multi-granularity encoding strategy integrating a masked autoregressive encoder with the SigLIP2 vision encoder; (2) progressive-resolution training coupled with dynamic parameter unfreezing; and (3) a million-scale reward-augmented multitask dataset. The model achieves high-resolution (1024×1024) generation on consumer-grade GPUs (<15 GB VRAM), attaining a GenEval score of 0.86 and 85.5 on DPG-Bench for complex generation—significantly outperforming existing unified architectures. It establishes a new Pareto-optimal trade-off between performance and deployment efficiency.
📝 Abstract
We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.