Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) suffer from a fundamental decoupling between visual understanding and generation capabilities, hindering their co-evolution and lacking interpretable, self-reflective generation mechanisms. To address this, we propose a novel understanding-generation co-evolution paradigm: modeling image generation as an iterative, chain-of-thought (CoT)-guided self-reflection process, and—crucially—introducing reinforcement learning (RL) into visual generation for the first time to elicit “aha moments.” Our method employs a two-stage training framework: supervised fine-tuning to bootstrap CoT-based reasoning capability, followed by Proximal Policy Optimization (PPO) to balance exploration and exploitation. Furthermore, we integrate MLLMs with a CoT-driven cross-modal alignment mechanism. Evaluated on text-to-image generation and image editing tasks, our approach achieves state-of-the-art performance while substantially improving semantic image assessment and visual understanding—unifying enhanced controllability, interpretability, and comprehension in generative vision systems.

Technology Category

Application Category

📝 Abstract

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.

Problem

Research questions and friction points this paper is trying to address.

Unifying visual comprehension and generation in MLLMs

Integrating LLM reasoning to enhance image generation

Enabling iterative introspective image generation process

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training for visual comprehension and generation

Reinforcement learning enhances exploration-exploitation trade-off

Unified image generation via iterative introspective process

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2024-02-09European Conference on Computer VisionCitations: 29

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

2024-04-19arXiv.orgCitations: 4