DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes DeepGen 1.0, a lightweight 5-billion-parameter unified model for multimodal image generation and editing that significantly reduces the computational cost typically associated with large-scale models exceeding 10 billion parameters. DeepGen 1.0 introduces a novel Stacked Channel Bridging mechanism to deeply integrate multi-level features from vision-language models with learnable “thought tokens,” enabling fine-grained semantic control. The model is trained via a three-stage strategy—alignment pretraining, joint fine-tuning, and MR-GRPO reinforcement learning—achieving strong performance despite using only 50 million training samples. On the WISE benchmark, DeepGen 1.0 outperforms HunyuanImage (80B parameters) by 28%, and surpasses Qwen-Image-Edit (27B parameters) by 37% on UniREditBench. The code, model weights, and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g.,>10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable'think tokens'to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
Problem

Research questions and friction points this paper is trying to address.

multimodal models
image generation
image editing
model scale
training cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stacked Channel Bridging
lightweight unified multimodal model
MR-GRPO reinforcement learning
think tokens
data-centric training strategy
🔎 Similar Papers
No similar papers found.
Dianyi Wang
Dianyi Wang
Fudan University&&Shanghai Innovation Institute
Multi-modal Learning
R
Ruihang Li
Shanghai Innovation Institute, University of Science and Technology of China
Feng Han
Feng Han
Fudan University
Trustworthy AIMLLM
C
Chaofan Ma
Shanghai Innovation Institute, Shanghai Jiao Tong University
Wei Song
Wei Song
Zhejiang University, Westlake University, Shanghai Innovation Institute
Artificial IntelligenceMulti-modal LearningMLLMs
Siyuan Wang
Siyuan Wang
University of Southern California
Machine ReasoningNatural Language Processing
Yibin Wang
Yibin Wang
Intern at UIUC
Trustworthy AI
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
Hongjian Liu
Hongjian Liu
Anhui Polytechnic University
Complex NetworksNeural Networks
Zhixiong Zhang
Zhixiong Zhang
Shanghai Jiao Tong University
Computer VisionMulti-modal Learning
Shengyuan Ding
Shengyuan Ding
Fudan University
Multimodal Learning
T
Tianhang Wang
Shanghai Innovation Institute, Zhejiang University
Zhenglin Cheng
Zhenglin Cheng
Zhejiang University & Westlake University, SII
Multimodal LearningDiffusion Models
T
Tao Lin
Westlake University
Cheng Jin
Cheng Jin
Fudan University
Image and Video ProcessingComputer VisionHCI
Kaicheng Yu
Kaicheng Yu
Assistant Professor, Westlake University, PI of Autonomous Intelligence Lab
computer vision3D understandingautonomous perceptionautomatic machine learning
Jingjing Chen
Jingjing Chen
Fudan University
MultimediaComputer VisionMachine LearningPattern recognition
W
Wenjie Wang
University of Science and Technology of China
Z
Zhongyu Wei
Shanghai Innovation Institute, Fudan University
Jiaqi Wang
Jiaqi Wang
Shanghai AI Laboratory
Computer VisionMulti-modal Learning