VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

๐Ÿ“… 2026-04-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

186K/year
๐Ÿค– AI Summary
This work addresses the limitations of existing photo retouching methods, which rely on non-differentiable external tools, leading to optimization difficulties, parameter redundancy, and poor generalization. To overcome these issues, we propose a lightweight, end-to-end differentiable retouching framework that leverages a 0.5B-parameter vision-language model to interpret both image defects and semantic editing instructions. A fully differentiable Retouch Renderer enables pixel-level training, while decoupled control latent variables and inverse degradation-based data synthesis enhance model generalization. Our contributions include AetherRetouch-1M+, the first million-scale professional retouching dataset, the differentiable renderer itself, and DAPO-AEโ€”a reinforcement learningโ€“based post-training strategy. The proposed method achieves state-of-the-art performance across multiple benchmarks, with significantly reduced model size, enabling efficient multi-task inference and mobile deployment.
๐Ÿ“ Abstract
Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.
Problem

Research questions and friction points this paper is trying to address.

photo retouching
differentiable framework
multi-task reasoning
optimization barriers
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

fully differentiable rendering
vision-language model
multi-task photo retouching
inverse degradation
reinforcement learning post-training
๐Ÿ”Ž Similar Papers