CAMILA: Context-Aware Masking for Image Editing with Language Alignment

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing text-guided image editing methods often blindly execute infeasible or contradictory instructions, leading to semantic distortions. To address this, we propose a context-aware instruction filtering and editing localization framework: (1) a semantic-matching context verification module dynamically assesses instruction feasibility; (2) an attention-guided region masking mechanism precisely identifies editable regions; and (3) end-to-end optimization incorporates a text–image alignment loss. To rigorously evaluate handling of infeasible instructions, we introduce the first benchmark dataset featuring both single- and multi-step infeasible requests. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches in semantic consistency (+12.7%) and image fidelity (+9.3% PSNR), with exceptional robustness under complex, conflicting instruction scenarios.

Technology Category

Application Category

📝 Abstract

Text-guided image editing has been allowing users to transform and synthesize images through natural language instructions, offering considerable flexibility. However, most existing image editing models naively attempt to follow all user instructions, even if those instructions are inherently infeasible or contradictory, often resulting in nonsensical output. To address these challenges, we propose a context-aware method for image editing named as CAMILA (Context-Aware Masking for Image Editing with Language Alignment). CAMILA is designed to validate the contextual coherence between instructions and the image, ensuring that only relevant edits are applied to the designated regions while ignoring non-executable instructions. For comprehensive evaluation of this new method, we constructed datasets for both single- and multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity.

Problem

Research questions and friction points this paper is trying to address.

Validating contextual coherence between instructions and images for editing

Ignoring non-executable or contradictory user instructions during image editing

Preventing nonsensical outputs from infeasible text-guided image editing requests

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-aware masking validates instruction-image coherence

Selectively applies feasible edits to designated regions

Achieves higher semantic alignment than state-of-the-art models

🔎 Similar Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning