MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor scalability of unified-architecture multimodal large language models (MLLMs) in personalized image generation—stemming from their reliance on subject-specific fine-tuning—this paper proposes X-CoT (Cross-modal Chain-of-Thought), the first framework to unify visual concept reasoning and generative synthesis within a zero-shot paradigm. X-CoT explicitly models the semantic-to-visual mapping between textual subjects and their visual representations, enabling subject-customized generation without any parameter updates. Furthermore, we introduce Grouped Reward Proximal Policy Optimization (GRPO), a novel reinforcement learning strategy that jointly optimizes subject fidelity and text–image alignment. Experiments demonstrate that X-CoT significantly outperforms existing baselines under zero-shot settings, achieving absolute improvements of +12.7% in Subject Consistency and +9.3% in Text–Image Alignment. Our approach effectively unlocks the potential of unified MLLMs for open-domain, zero-shot personalized image generation.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) with unified architectures excel across a wide range of vision-language tasks, yet aligning them with personalized image generation remains a significant challenge. Existing methods for MLLMs are frequently subject-specific, demanding a data-intensive fine-tuning process for every new subject, which limits their scalability. In this paper, we introduce MM-R1, a framework that integrates a cross-modal Chain-of-Thought (X-CoT) reasoning strategy to unlock the inherent potential of unified MLLMs for personalized image generation. Specifically, we structure personalization as an integrated visual reasoning and generation process: (1) grounding subject concepts by interpreting and understanding user-provided images and contextual cues, and (2) generating personalized images conditioned on both the extracted subject representations and user prompts. To further enhance the reasoning capability, we adopt Grouped Reward Proximal Policy Optimization (GRPO) to explicitly align the generation. Experiments demonstrate that MM-R1 unleashes the personalization capability of unified MLLMs to generate images with high subject fidelity and strong text alignment in a zero-shot manner.
Problem

Research questions and friction points this paper is trying to address.

Aligning unified MLLMs with personalized image generation
Overcoming data-intensive fine-tuning for new subjects
Enhancing subject fidelity and text alignment in generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified MLLMs with X-CoT reasoning
Visual reasoning and generation integration
GRPO for enhanced generation alignment
Q
Qian Liang
University of Electronic Science and Technology of China
Y
Yujia Wu
University of Electronic Science and Technology of China
K
Kuncheng Li
University of Electronic Science and Technology of China
Jiwei Wei
Jiwei Wei
Professor at University of Electronic Science and Technology of China (UESTC)
Cross-Modal RetrievalMetric LearningAdversarial Machine LearningAIGC
S
Shiyuan He
University of Electronic Science and Technology of China
Jinyu Guo
Jinyu Guo
University of Electronic Science and Technology of China
Natural Language Processing
N
Ning Xie
University of Electronic Science and Technology of China