🤖 AI Summary
Existing image restoration methods rely on pixel-wise hard fitting, often resulting in over-smoothing and poor generalization. To address this, we propose IRPO—the first post-training optimization framework adapting GRPO (Generalized Reinforcement Learning-based Policy Optimization) to low-level vision. Its core contributions are: (1) a task-aware data filtering criterion for image restoration, jointly considering structural fidelity, perceptual alignment, and task-specific characteristics; and (2) a triple-reward system integrating a generic reward, a Qwen-VL–driven expert perceptual reward, and a task-specific restoration reward—enabling joint pixel-level and perceptual-level optimization. IRPO achieves state-of-the-art performance on six in-domain and five out-of-domain benchmarks. Compared to AdaIR, it improves in-domain PSNR by 0.83 dB and out-of-domain PSNR by up to 3.43 dB, demonstrating significantly enhanced generalization capability and restoration quality.
📝 Abstract
Recent advances in post-training paradigms have achieved remarkable success in high-level generation tasks, yet their potential for low-level vision remains rarely explored. Existing image restoration (IR) methods rely on pixel-level hard-fitting to ground-truth images, struggling with over-smoothing and poor generalization. To address these limitations, we propose IRPO, a low-level GRPO-based post-training paradigm that systematically explores both data formulation and reward modeling. We first explore a data formulation principle for low-level post-training paradigm, in which selecting underperforming samples from the pre-training stage yields optimal performance and improved efficiency. Furthermore, we model a reward-level criteria system that balances objective accuracy and human perceptual preference through three complementary components: a General Reward for structural fidelity, an Expert Reward leveraging Qwen-VL for perceptual alignment, and a Restoration Reward for task-specific low-level quality. Comprehensive experiments on six in-domain and five out-of-domain (OOD) low-level benchmarks demonstrate that IRPO achieves state-of-the-art results across diverse degradation types, surpassing the AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.