π€ AI Summary
This study addresses three key challenges: weak reasoning capabilities in multimodal language models (MLLMs), low controllability in diffusion models (DMs), and the absence of a synergistic optimization mechanism between them. To this end, we propose UniRL-Zeroβthe first unified reinforcement learning (RL) framework for joint optimization of understanding and generation. Methodologically, we design six cross-modal RL scenarios and establish bidirectional reward signals linking language understanding and visual generation, enabling end-to-end co-training of MLLM and DM experts. Our contributions are threefold: (1) introducing the first benchmark for understanding-generation joint RL; (2) achieving significant improvements in cross-modal reasoning and controllable generation across diverse multimodal tasks; and (3) open-sourcing the codebase and training protocols to advance research in interactive multimodal learning.
π Abstract
We present UniRL-Zero, a unified reinforcement learning (RL) framework that boosts, multimodal language model understanding and reasoning, diffusion model multimedia generation, and their beneficial interaction capabilities within a unified model. Our work defines six scenarios for unified model reinforcement learning, providing systematic baselines for reinforcement learning of unified understanding and generation model. Our code is available at https://github.com/G-U-N/UniRL.