Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the limitations of existing unified multimodal models, which rely on multi-task mixed training and are prone to task interference and data heterogeneity, thereby struggling to jointly enhance comprehension, generation, and editing capabilities. To overcome this, we propose Uni-Edit, which uniquely treats intelligent image editing as a universal task for unified fine-tuning, achieving simultaneous improvement across all three capabilities through single-task, single-stage training on a single dataset. We introduce the first automated and scalable pipeline for synthesizing complex editing instructions, leveraging VQA data to generate high-difficulty prompts embedded with implicit questions and nested logical structures, resulting in the Uni-Edit-148k dataset. Experiments on models such as BAGEL and Janus-Pro demonstrate that fine-tuning solely with Uni-Edit consistently outperforms conventional multi-task approaches without requiring any auxiliary operations.

📝 Abstract

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

Problem

Research questions and friction points this paper is trying to address.

Unified Multimodal Models

image editing

multi-task training

task conflict

model tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uni-Edit

unified multimodal model

intelligent image editing