๐ค AI Summary
This work addresses the limitations of existing photo retouching methods, which rely on non-differentiable external tools, leading to optimization difficulties, parameter redundancy, and poor generalization. To overcome these issues, we propose a lightweight, end-to-end differentiable retouching framework that leverages a 0.5B-parameter vision-language model to interpret both image defects and semantic editing instructions. A fully differentiable Retouch Renderer enables pixel-level training, while decoupled control latent variables and inverse degradation-based data synthesis enhance model generalization. Our contributions include AetherRetouch-1M+, the first million-scale professional retouching dataset, the differentiable renderer itself, and DAPO-AEโa reinforcement learningโbased post-training strategy. The proposed method achieves state-of-the-art performance across multiple benchmarks, with significantly reduced model size, enabling efficient multi-task inference and mobile deployment.
๐ Abstract
Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.