UniRL-Zero: Reinforcement Learning on Unified Models with Joint Language Model and Diffusion Model Experts

πŸ“… 2025-10-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses three key challenges: weak reasoning capabilities in multimodal language models (MLLMs), low controllability in diffusion models (DMs), and the absence of a synergistic optimization mechanism between them. To this end, we propose UniRL-Zeroβ€”the first unified reinforcement learning (RL) framework for joint optimization of understanding and generation. Methodologically, we design six cross-modal RL scenarios and establish bidirectional reward signals linking language understanding and visual generation, enabling end-to-end co-training of MLLM and DM experts. Our contributions are threefold: (1) introducing the first benchmark for understanding-generation joint RL; (2) achieving significant improvements in cross-modal reasoning and controllable generation across diverse multimodal tasks; and (3) open-sourcing the codebase and training protocols to advance research in interactive multimodal learning.

Technology Category

Application Category

πŸ“ Abstract
We present UniRL-Zero, a unified reinforcement learning (RL) framework that boosts, multimodal language model understanding and reasoning, diffusion model multimedia generation, and their beneficial interaction capabilities within a unified model. Our work defines six scenarios for unified model reinforcement learning, providing systematic baselines for reinforcement learning of unified understanding and generation model. Our code is available at https://github.com/G-U-N/UniRL.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal language model understanding and reasoning capabilities
Improving diffusion model multimedia generation quality
Facilitating beneficial interactions between understanding and generation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified reinforcement learning framework for multimodal models
Integrates language model reasoning with diffusion generation
Defines six scenarios for systematic RL baselines
πŸ”Ž Similar Papers
No similar papers found.
Fu-Yun Wang
Fu-Yun Wang
Ph.D. candidate, Chinese University of Hong Kong
machine learningcomputer vision
H
Han Zhang
M
Michael Gharbi
H
Hongsheng Li
EE, The Chinese University of Hong Kong
T
Taesung Park