InstructX: Towards Unified Visual Editing with MLLM Guidance

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of scarce video editing data and the difficulty of unifying image and video editing within a single modeling framework, this paper proposes InstructX—the first unified image/video instruction-guided editing framework that leverages multimodal large language models (MLLMs) to steer diffusion models. Methodologically, InstructX achieves zero-shot high-fidelity video editing capability using only image-only training data; introduces a modality-aware feature alignment mechanism enabling adaptive cross-modal feature fusion and instruction alignment between images and videos within a shared diffusion backbone; and jointly optimizes the MLLM’s semantic understanding and the diffusion model’s generative capacity. Evaluated on multiple image and video editing benchmarks, InstructX achieves state-of-the-art performance while drastically reducing reliance on annotated video data. Experimental results validate both the effectiveness and strong generalization of unified multimodal editing modeling.

Technology Category

Application Category

📝 Abstract
With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Integrating MLLMs with diffusion models for visual editing
Analyzing MLLM design choices for instruction-driven editing tasks
Unifying image and video editing capabilities in one framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training on images enables emergent video editing
Unifying image and video editing with MLLM features
Single model handles diverse image and video tasks
Chong Mou
Chong Mou
Peking University
Diffusion ModelAI Generated ContentLow-level Computer Vision
Q
Qichao Sun
Intelligent Creation Team, ByteDance
Yanze Wu
Yanze Wu
ByteDance
computer vision
P
Pengze Zhang
Intelligent Creation Team, ByteDance
X
Xinghui Li
Intelligent Creation Team, ByteDance
Fulong Ye
Fulong Ye
ByteDance
Vision-Language PretrainGenerative modelsDiffusion Models
S
Songtao Zhao
Intelligent Creation Team, ByteDance
Qian He
Qian He
ByteDance