Meta-CoT: Enhancing Granularity and Generalization in Image Editing

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge in image editing of simultaneously achieving fine-grained understanding and cross-task generalization. To this end, the authors propose the Meta-CoT paradigm, which first decomposes editing intent into structured triplets of (task, target, required reasoning capability) and then groups diverse editing tasks into five generalizable meta-tasks. A Chain-of-Thought Editing (CoT-Editing) consistency reward mechanism is introduced to align the model’s reasoning chain with its editing actions. This approach enables, for the first time, a two-level learnable decomposition of editing tasks, substantially enhancing both interpretability and generalization. Experimental results demonstrate that, when trained on only a few meta-tasks, the model achieves an average performance gain of 15.8% across 21 distinct editing tasks.

Technology Category

Application Category

📝 Abstract

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought

image editing

granularity

generalization

multi-modal understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought

meta-task decomposition

image editing