X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing open-source datasets for instruction-guided image editing suffer from low quality, while current editing modules exhibit poor generalization across diverse tasks. Method: We introduce X2Edit—a high-quality, balanced dataset comprising 3.7 million samples spanning 14 instruction categories—and propose a task-aware MoE-LoRA fine-tuning framework. Specifically: (1) instructions are generated by vision-language models and filtered via multi-dimensional automatic scoring to ensure high-fidelity edit pairs; (2) contrastive learning is applied in the diffusion model’s latent space to enhance editing representation fidelity; (3) a lightweight MoE-LoRA adapter is implemented on FLUX.1, enabling plug-and-play compatibility with mainstream generative models. Contribution/Results: Our method achieves state-of-the-art performance on multiple benchmarks using only 8% of the full-parameter model’s parameters. Both the X2Edit dataset and source code are publicly released to advance community research.

Technology Category

Application Category

📝 Abstract

Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model's editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: https://github.com/OPPO-Mente-Lab/X2Edit.

Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality datasets for arbitrary-instruction image editing

Absence of plug-and-play editing module for generative models

Need for task-aware representation learning in image editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-constructed X2Edit dataset with 3.7M balanced samples

Task-aware MoE-LoRA training with 8% parameters

Contrastive learning using diffusion model representations

🔎 Similar Papers

No similar papers found.