Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing

📅 2024-10-15

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing evaluation metrics for text-guided image editing suffer from a “contextual blind spot,” failing to dynamically balance fidelity to the source image and edit compliance with the target text. While directional CLIP similarity incorporates contextual information, it is inherently biased toward edit magnitude and vulnerable to interference from irrelevant image regions. This paper proposes AugCLIP, a context-aware evaluation metric that introduces a novel multimodal large language model (MLLM)-based text augmentation method. AugCLIP constructs an attribute-separating hyperplane in the CLIP embedding space to adaptively coordinate preservation and modification weights. It jointly models semantic consistency, structural fidelity, and editing accuracy within a unified framework. Evaluated on five standard benchmarks, AugCLIP significantly outperforms existing metrics and achieves state-of-the-art correlation with human judgments. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the preservation of core elements in the source image while implementing modifications based on the target text. However, existing metrics have a context-blindness problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text, biasing towards either modification or preservation. Directional CLIP similarity, the only metric that considers both source image and target text, is also biased towards modification aspects and attends to irrelevant editing regions of the image. We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects, depending on the specific context of a given source image and target text. This is done by deriving the CLIP representation of an ideally edited image, that preserves the source image with necessary modifications to align with target text. More specifically, using a multi-modal large language model, AugCLIP augments the textual descriptions of the source and target, then calculates a modification vector through a hyperplane that separates source and target attributes in CLIP space. Extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, show that AugCLIP aligns remarkably well with human evaluation standards, outperforming existing metrics. The code is available at https://github.com/augclip/augclip_eval.

Problem

Research questions and friction points this paper is trying to address.

Balancing preservation and modification in text-guided image editing.

Addressing context-blindness in existing image editing evaluation metrics.

Proposing AugCLIP for adaptive, context-aware evaluation of image edits.

Innovation

Methods, ideas, or system contributions that make the work stand out.

AugCLIP adaptively balances preservation and modification.

Uses multi-modal model to augment textual descriptions.

Calculates modification vector via CLIP space hyperplane.

🔎 Similar Papers

No similar papers found.