🤖 AI Summary
Image and video generation/editing suffer from long-standing modality fragmentation: while image generation has converged toward unified frameworks, video generation remains highly fragmented due to architectural constraints and data scarcity. Method: We propose UniEdit—the first instruction-driven unified multimodal model—treating text, images, and videos as tokenized sequences within a single self-attention architecture. It enables cross-modal in-context learning and flexible processing of arbitrary spatial resolutions and temporal durations. Trained on 232K high-quality video editing samples under a multimodal joint training paradigm, UniEdit supports cross-modal knowledge transfer and emergent editing capabilities. Contribution/Results: UniEdit significantly outperforms leading open-source and commercial models across multiple benchmarks and user studies. To foster community advancement, we also release EditVerseBench—the first dedicated video editing evaluation benchmark—along with the model and training code.
📝 Abstract
Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.