JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

📅 2025-06-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI-powered photo editing tools suffer from limited adjustability and poor generalization, failing to simultaneously deliver professional-grade control and accurate interpretation of user-specific intent. This paper introduces JarvisArt—the first multimodal large language model (MLLM)-driven intelligent agent tailored for professional color grading. It employs a two-stage training paradigm: chain-of-thought supervised fine-tuning followed by color-grading-oriented Groupwise Relative Policy Optimization (GRPO-R), emulating expert artists’ decision-making to orchestrate over 200 Lightroom parameters for fine-grained global and local edits. We further design an Agent-to-Lightroom communication protocol enabling end-to-end seamless integration. Evaluated on MMArt-Bench—a real-user-constructed benchmark—JarvisArt achieves a 60% improvement in pixel-level content fidelity over GPT-4o while matching its instruction-following capability, significantly enhancing controllability and cross-scenario generalization.

Technology Category

Application Category

📝 Abstract
Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. Project Page: https://jarvisart.vercel.app/.
Problem

Research questions and friction points this paper is trying to address.

Bridges gap between professional tools and AI automation
Enhances adjustability and generalization in photo retouching
Meets diverse and personalized editing needs intelligently
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-driven agent for photo retouching
Two-stage training with GRPO-R optimization
Agent-to-Lightroom Protocol for seamless integration
🔎 Similar Papers
Y
Yunlong Lin
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Fujian, China
Z
Zixu Lin
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Fujian, China
K
Kunjie Lin
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Fujian, China
Jinbin Bai
Jinbin Bai
National University of Singapore
Machine LearningContent CreationGenerative Modeling
Panwang Pan
Panwang Pan
ByteDance
Multi-modal LearningGenerative AICV/ML
Chenxin Li
Chenxin Li
The Chinese University of Hong Kong
Multimodal LLMAgentWorld Model
H
Haoyu Chen
The Hong Kong University of Science and Technology (Guangzhou)
Zhongdao Wang
Zhongdao Wang
Noah's Ark Lab, Huawei
computer visionautonomous driving
Xinghao Ding
Xinghao Ding
Unknown affiliation
Wenbo Li
Wenbo Li
The Chinese University of Hong Kong
Computer VisionDeep Learning
S
Shuicheng Yan
National University of Singapore