ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional automated video editing struggles to model creators’ personalized artistic expression, particularly lacking stylistic consistency and narrative coherence during shot assembly. This paper proposes an energy-based shot sequence optimization framework that, for the first time, enables transfer of editing style from reference videos to generative editing pipelines. Our method leverages a large language model to generate narrative scripts, retrieves candidate shots via vision–language alignment, and constructs a style-aware energy function grounded in fine-grained shot annotations—including shot scale, camera motion, and semantic content—obtained through shot segmentation and multi-attribute labeling. The energy function is further regularized by cinematographic grammar rules to ensure globally optimal shot sequencing. The framework jointly models multiple editing attributes and supports customizable artistic expression, significantly improving stylistic consistency and narrative quality of generated videos. It empowers novice users to produce professional-grade video content without prior editing expertise.

Technology Category

Application Category

📝 Abstract
Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator's unique artistic expression in shot assembly. To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: https://sobeymil.github.io/esa.com
Problem

Research questions and friction points this paper is trying to address.

Automating video shot assembly while capturing artistic expression
Learning assembly styles from reference videos using energy models
Enabling novice users to create professional-style video edits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy-based optimization for video shot assembly
Visual-semantic matching between script and video library
Learning assembly style from reference videos with energy models
🔎 Similar Papers
No similar papers found.
Y
Yaosen Chen
Sobey Media Intelligence Laboratory, University of Electronic Science and Technology of China
W
Wei Wang
Sobey Media Intelligence Laboratory
T
Tianheng Zheng
Sobey Media Intelligence Laboratory, SiChuan University
X
Xuming Wen
Sobey Media Intelligence Laboratory
H
Han Yang
Sobey Media Intelligence Laboratory, Qinghai Normal University
Yanru Zhang
Yanru Zhang
Professor, University of Electronic Science and Technology of China
Game TheorySmart GridWireless Networking