Edit3K: Universal Representation Learning for Video Editing Components

๐Ÿ“… 2024-03-24
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of learning visual representations for editing actionsโ€”such as effects, animations, transitions, filters, stickers, and textโ€”in video editing. We propose the first representation learning paradigm explicitly designed for editing components rather than raw video content. To this end, we introduce Edit3K, a benchmark dataset comprising 618,000 synthetically generated videos spanning 3,094 atomic editing component classes. We design a material-agnostic attention mechanism to disentangle editing appearance from underlying content, and integrate contrastive learning, synthesis-driven data construction, and cross-material feature alignment. Our method achieves significant improvements over state-of-the-art approaches on editing component retrieval and classification tasks. Clustering results exhibit stronger alignment with human perceptual similarity judgments. Moreover, it attains state-of-the-art performance on the AutoTransition task for automated transition recommendation.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper focuses on understanding the predominant video creation pipeline, i.e., compositional video editing with six main types of editing components, including video effects, animation, transition, filter, sticker, and text. In contrast to existing visual representation learning of visual materials (i.e., images/videos), we aim to learn visual representations of editing actions/components that are generally applied on raw materials. We start by proposing the first large-scale dataset for editing components of video creation, which covers about $3,094$ editing components with $618,800$ videos. Each video in our dataset is rendered by various image/video materials with a single editing component, which supports atomic visual understanding of different editing components. It can also benefit several downstream tasks, e.g., editing component recommendation, editing component recognition/retrieval, etc. Existing visual representation methods perform poorly because it is difficult to disentangle the visual appearance of editing components from raw materials. To that end, we benchmark popular alternative solutions and propose a novel method that learns to attend to the appearance of editing components regardless of raw materials. Our method achieves favorable results on editing component retrieval/recognition compared to the alternative solutions. A user study is also conducted to show that our representations cluster visually similar editing components better than other alternatives. Furthermore, our learned representations used to transition recommendation tasks achieve state-of-the-art results on the AutoTransition dataset. The code and dataset are available at https://github.com/GX77/Edit3K .
Problem

Research questions and friction points this paper is trying to address.

Learn visual representations of editing components
Disentangle editing components from raw materials
Improve editing component retrieval and recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for video editing
Novel method for component appearance
State-of-the-art transition recommendation
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xin Gu
University of Chinese Academy of Sciences
L
Libo Zhang
Institute of Software, Chinese Academy of Sciences
F
Fan Chen
ByteDance Inc.
Longyin Wen
Longyin Wen
Bytedance Inc.
Artificial IntelligenceComputer VisionMachine Learning
Y
Yufei Wang
ByteDance Inc.
Tiejian Luo
Tiejian Luo
University of Chinese Academy of Sciences
Learning ScienceData ScienceAI
Sijie Zhu
Sijie Zhu
Unknown affiliation