JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two critical limitations of existing agent-based image editing models: instruction hallucination—arising from unimodal text-only reasoning—and reward hacking—caused by static, externally defined reward functions. To overcome these, we propose a self-evolving multimodal image editing agent. Methodologically, we introduce an interleaved multimodal chain-of-thought mechanism that jointly performs vision-language reasoning and Adobe Lightroom tool invocation; further, we design a co-evolutionary editing-evaluation strategy optimization framework enabling dynamic, reward-free reflection and iterative refinement. Evaluated on ArtEdit-Bench, our approach achieves a 18.95% improvement in fidelity-preserving editing performance and a 44.96% gain in pixel-level content fidelity over Nano-Banana, while significantly enhancing global structural consistency and fine-grained local control.

Technology Category

Application Category

📝 Abstract
Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination, text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking, dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both global and local fine-grained editing through seamless integration of Adobe Lightroom. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity.
Problem

Research questions and friction points this paper is trying to address.

Addresses instruction hallucination in image editing agents
Mitigates reward hacking through synergistic editor-evaluator optimization
Enhances editing quality with multimodal reasoning and self-reflection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal chain-of-thought reasoning for accurate instruction following
Synergistic editor-evaluator optimization to prevent reward hacking
Seamless Adobe Lightroom integration for global and local editing
🔎 Similar Papers
No similar papers found.
Y
Yunlong Lin
Tencent Hunyuan
L
Linqing Wang
Tencent Hunyuan
K
Kunjie Lin
Tencent Hunyuan
Z
Zixu Lin
Tencent Hunyuan
Kaixiong Gong
Kaixiong Gong
MMLab, CUHK
Multimodal LearningGeneration
Wenbo Li
Wenbo Li
The Chinese University of Hong Kong
Computer VisionDeep Learning
B
Bin Lin
Tencent Hunyuan
Z
Zhenxi Li
Tencent Hunyuan
Shiyi Zhang
Shiyi Zhang
Tsinghua University
Video GenerationVideo Understanding
Y
Yuyang Peng
Tencent Hunyuan
W
Wenxun Dai
Tencent Hunyuan
Xinghao Ding
Xinghao Ding
Unknown affiliation
C
Chunyu Wang
Tencent Hunyuan
Q
Qinglin Lu
Tencent Hunyuan