FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing

📅 2024-08-22

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the challenge of imprecise spatial localization of image editing regions from natural language instructions, this paper proposes an end-to-end image editing method integrating free-form masks with vision-language large models (VLLMs). The method jointly leverages diffusion models and lightweight prompting strategies to enhance fine-grained regional editing accuracy while preserving controllability. Its core contributions are: (1) a Mask Enhancement Adapter (MEA) that enables deep alignment between VLLM multimodal embeddings and arbitrary-shaped masks; and (2) FSMI-Edit—the first benchmark dedicated to free-form mask-guided editing, comprising eight distinct mask morphologies. Extensive experiments demonstrate state-of-the-art performance on LLM-driven image editing tasks, validating the effectiveness and generalizability of jointly modeling cross-modal language–image semantics and mask-guided spatial priors.

Technology Category

Application Category

📝 Abstract

Combining Vision Large Language Models (VLLMs) with diffusion models offers a powerful method for executing image editing tasks based on human language instructions. However, language instructions alone often fall short in accurately conveying user requirements, particularly when users want to add, replace elements in specific areas of an image. Luckily, masks can effectively indicate the exact locations or elements to be edited, while they require users to precisely draw the shapes at the desired locations, which is highly user-unfriendly. To address this, we propose FlexEdit, an end-to-end image editing method that leverages both free-shape masks and language instructions for Flexible Editing. Our approach employs a VLLM in comprehending the image content, mask, and user instructions. Additionally, we introduce the Mask Enhance Adapter (MEA) that fuses the embeddings of the VLLM with the image data, ensuring a seamless integration of mask information and model output embeddings. Furthermore, we construct FSMI-Edit, a benchmark specifically tailored for free-shape mask, including 8 types of free-shape mask. Extensive experiments show that our method achieves state-of-the-art (SOTA) performance in LLM-based image editing, and our simple prompting technique stands out in its effectiveness. The code and data can be found at https://github.com/A-new-b/flex_edit.

Problem

Research questions and friction points this paper is trying to address.

Enhances image editing precision with free-shape masks and language instructions

Improves user-friendliness by avoiding precise manual mask drawing

Integrates VLLM and masks for flexible, accurate image modifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines VLLMs with free-shape masks

Uses Mask Enhance Adapter for embedding fusion

Introduces FSMI-Edit benchmark for mask evaluation

🔎 Similar Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning