FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing

📅 2024-08-22
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of imprecise spatial localization of image editing regions from natural language instructions, this paper proposes an end-to-end image editing method integrating free-form masks with vision-language large models (VLLMs). The method jointly leverages diffusion models and lightweight prompting strategies to enhance fine-grained regional editing accuracy while preserving controllability. Its core contributions are: (1) a Mask Enhancement Adapter (MEA) that enables deep alignment between VLLM multimodal embeddings and arbitrary-shaped masks; and (2) FSMI-Edit—the first benchmark dedicated to free-form mask-guided editing, comprising eight distinct mask morphologies. Extensive experiments demonstrate state-of-the-art performance on LLM-driven image editing tasks, validating the effectiveness and generalizability of jointly modeling cross-modal language–image semantics and mask-guided spatial priors.

Technology Category

Application Category

📝 Abstract
Combining Vision Large Language Models (VLLMs) with diffusion models offers a powerful method for executing image editing tasks based on human language instructions. However, language instructions alone often fall short in accurately conveying user requirements, particularly when users want to add, replace elements in specific areas of an image. Luckily, masks can effectively indicate the exact locations or elements to be edited, while they require users to precisely draw the shapes at the desired locations, which is highly user-unfriendly. To address this, we propose FlexEdit, an end-to-end image editing method that leverages both free-shape masks and language instructions for Flexible Editing. Our approach employs a VLLM in comprehending the image content, mask, and user instructions. Additionally, we introduce the Mask Enhance Adapter (MEA) that fuses the embeddings of the VLLM with the image data, ensuring a seamless integration of mask information and model output embeddings. Furthermore, we construct FSMI-Edit, a benchmark specifically tailored for free-shape mask, including 8 types of free-shape mask. Extensive experiments show that our method achieves state-of-the-art (SOTA) performance in LLM-based image editing, and our simple prompting technique stands out in its effectiveness. The code and data can be found at https://github.com/A-new-b/flex_edit.
Problem

Research questions and friction points this paper is trying to address.

Enhances image editing precision with free-shape masks and language instructions
Improves user-friendliness by avoiding precise manual mask drawing
Integrates VLLM and masks for flexible, accurate image modifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines VLLMs with free-shape masks
Uses Mask Enhance Adapter for embedding fusion
Introduces FSMI-Edit benchmark for mask evaluation
J
Jue Wang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Y
Yuxiang Lin
Shenzhen Technology University
T
Tianshuo Yuan
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Zhi-Qi Cheng
Zhi-Qi Cheng
Assistant Professor @ UW | Graduate Faculty | Ex-CMU, Google, Microsoft | Intel & IBM PhD Fellowship
multimedia processingmultimedia understandingmultimodal foundation model
X
Xiaolong Wang
Shenzhen Technology University
G
GH Jiao
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
W
Wei Chen
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Xiaojiang Peng
Xiaojiang Peng
Shenzhen Technology University
Computer VisionFacial Expression RecognitionMultimodal Emotion Recognition