🤖 AI Summary
Existing text-guided image editing methods rely on manual tuning of multiple coupled hyperparameters—such as inversion timesteps and attention modifications—resulting in a large search space and high computational cost.
Method: We propose the first reinforcement learning–based framework for automatic hyperparameter optimization, formulating hyperparameter tuning as a sequential decision-making problem within the diffusion denoising process. We model it as a Markov Decision Process (MDP) to enable adaptive, timestep-aware adjustment, and employ Proximal Policy Optimization (PPO) with a composite reward function that jointly optimizes editing fidelity and target alignment.
Results: Experiments demonstrate that our method significantly reduces search overhead (5.2× speedup on average), improves editing quality (18.7% lower FID), and enhances controllability without compromising generation diversity. This establishes an efficient, robust paradigm for hyperparameter optimization in practical diffusion model deployment.
📝 Abstract
Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification, extit{etc.} This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing's hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world.