MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

📅 2025-05-09

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the challenge novice users face in planning multi-step, professional-grade photo retouching operations, this paper proposes a multimodal large language model (MLLM) framework tailored for programmatic image editing. Methodologically, we introduce visual puzzle pretraining to enhance the MLLM’s understanding of low-level image processing semantics; design a traceable, user-controllable, pixel-faithful editing reasoning paradigm; and integrate programmatic operation modeling, expert-edit-guided reverse synthesis for training data construction, and multi-stage vision-language grounding fine-tuning. Compared with both generative and conventional programmatic approaches, our method significantly improves edit interpretability, identity preservation, and fine-grained detail fidelity, achieving state-of-the-art performance across multiple benchmarks. Our core contribution is the first end-to-end MLLM framework capable of planning and reasoning over executable, verifiable, and identity-preserving programmatic retouching operations.

Technology Category

Application Category

📝 Abstract

Retouching is an essential task in post-manipulation of raw photographs. Generative editing, guided by text or strokes, provides a new tool accessible to users but can easily change the identity of the original objects in unacceptable and unpredictable ways. In contrast, although traditional procedural edits, as commonly supported by photoediting tools (e.g., Gimp, Lightroom), are conservative, they are still preferred by professionals. Unfortunately, professional quality retouching involves many individual procedural editing operations that is challenging to plan for most novices. In this paper, we ask if a multimodal large language model (MLLM) can be taught to critique raw photographs, suggest suitable remedies, and finally realize them with a given set of pre-authored procedural image operations. We demonstrate that MLLMs can be first made aware of the underlying image processing operations, by training them to solve specially designed visual puzzles. Subsequently, such an operation-aware MLLM can both plan and propose edit sequences. To facilitate training, given a set of expert-edited photos, we synthesize a reasoning dataset by procedurally manipulating the expert edits and then grounding a pretrained LLM on the visual adjustments, to synthesize reasoning for finetuning. The proposed retouching operations are, by construction, understandable by the users, preserve object details and resolution, and can be optionally overridden. We evaluate our setup on a variety of test examples and show advantages, in terms of explainability and identity preservation, over existing generative and other procedural alternatives. Code, data, models, and supplementary results can be found via our project website at https://monetgpt.github.io.

Problem

Research questions and friction points this paper is trying to address.

Enhance MLLMs' ability to critique and retouch raw photographs

Plan procedural edits using pre-authored image operations

Preserve object identity and details during retouching

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLMs trained with visual puzzles for operation awareness

Synthetic reasoning dataset from expert-edited photos

Procedural edits preserving object details and resolution

🔎 Similar Papers

No similar papers found.