From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Subject-driven image generation suffers from an inherent conflict between identity fidelity and text prompt adherence. To address this, we propose Synergy-Aware Reinforcement Learning—a novel framework that mitigates conflicting optimization signals via nonlinear reward shaping while amplifying synergistic effects between fidelity and alignment. We further introduce a Time-Aware Dynamic Weighting strategy that adaptively modulates the trade-off weights based on the temporal characteristics of diffusion model denoising steps. Built upon the GRPO framework, our method enables online, end-to-end co-optimization. Extensive experiments on diverse editing tasks demonstrate significant improvements over baselines: competitive degradation is effectively alleviated, identity preservation increases by 12.6%, and prompt adherence—measured by CLIP-Score—improves by 9.3%. To our knowledge, this is the first approach to achieve high-fidelity, prompt-aligned subject-driven generation within a unified optimization framework.

Technology Category

Application Category

📝 Abstract
Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
Problem

Research questions and friction points this paper is trying to address.

Addresses fidelity-editability trade-off in subject-driven image generation
Mitigates competitive degradation from conflicting RL reward signals
Aligns optimization with temporal dynamics of diffusion process
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synergy-Aware Reward Shaping penalizes conflicts and amplifies synergies
Time-Aware Dynamic Weighting aligns optimization with temporal dynamics
Customized-GRPO framework mitigates competitive degradation in RL training
🔎 Similar Papers
No similar papers found.