π€ AI Summary
This work addresses the limitations of existing conditional image editing methods, which typically rely on single-step generation and lack explicit quality control, often resulting in structural distortions, contextual inconsistencies, and excessive deviation from the original image. To overcome these issues, we propose CAMEOβa quality-aware, feedback-driven multi-agent collaborative editing framework that introduces, for the first time, an integrated quality assessment and iterative feedback mechanism. CAMEO enables closed-loop optimization through coordinated planning, structured prompting, hypothesis generation, and adaptive reference fusion. The framework is compatible with mainstream editing backbone models and demonstrates significant improvements over multiple state-of-the-art approaches, achieving a 20% average win rate gain on tasks such as anomaly insertion and human pose transfer, thereby substantially enhancing controllability, robustness, and structural consistency in image editing.
π Abstract
Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.