🤖 AI Summary
In reinforcement learning (RL)-based fine-tuning of generative models, fixed-divergence regularization struggles to balance exploration and exploitation: strong regularization hinders reward optimization, while weak regularization risks instability or reward hacking. To address this, we propose Adaptive Divergence-Regularized Proximal Policy Optimization (ADRPO), which dynamically modulates the regularization strength—based on advantage estimates—for both Wasserstein-2 and KL divergences. This enables aggressive exploration for high-quality samples and stable constraint enforcement for low-quality ones. ADRPO is compatible with flow-matching generative models and generalizes to large language models (LLMs) and multimodal reasoning models. Experiments demonstrate that a 2B-parameter model fine-tuned with ADRPO surpasses 4.8B- and 12B-parameter baselines across multiple metrics in text-to-image generation and multimodal reasoning tasks; a 7B-parameter model outperforms Gemini 2.5 Pro and GPT-4o Audio, achieving significant gains in semantic alignment and generation diversity.
📝 Abstract
Balancing exploration and exploitation during reinforcement learning fine-tuning of generative models presents a critical challenge, as existing approaches rely on fixed divergence regularization that creates an inherent dilemma: strong regularization preserves model capabilities but limits reward optimization, while weak regularization enables greater alignment but risks instability or reward hacking. We introduce Adaptive Divergence Regularized Policy Optimization (ADRPO), which automatically adjusts regularization strength based on advantage estimates-reducing regularization for high-value samples while applying stronger regularization to poor samples, enabling policies to navigate between exploration and aggressive exploitation according to data quality. Our implementation with Wasserstein-2 regularization for flow matching generative models achieves remarkable results on text-to-image generation, achieving better semantic alignment and diversity than offline methods like DPO and online methods with fixed regularization like ORW-CFM-W2. ADRPO enables a 2B parameter SD3 model to surpass much larger models with 4.8B and 12B parameters in attribute binding, semantic consistency, artistic style transfer, and compositional control while maintaining generation diversity. ADRPO generalizes to KL-regularized fine-tuning of both text-only LLMs and multi-modal reasoning models, enhancing existing online RL methods like GRPO. In LLM fine-tuning, ADRPO demonstrates an emergent ability to escape local optima through active exploration, while in multi-modal audio reasoning, it outperforms GRPO through superior step-by-step reasoning, enabling a 7B model to outperform substantially larger commercial models including Gemini 2.5 Pro and GPT-4o Audio, offering an effective plug-and-play solution to the exploration-exploitation challenge across diverse generative architectures and modalities.