When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses automated editing of multi-camera classical concert videos. To tackle this problem, we propose an end-to-end multimodal framework that decouples the two core subtasks: “when to cut” (cut-point detection) and “how to cut” (shot selection). Methodologically, we design a lightweight convolutional-Transformer hybrid architecture to jointly model log-mel spectrograms, CLIP-based visual embeddings, and scalar temporal features. To mitigate cross-performance interference, we introduce intra-concert segment constraints; additionally, we leverage CLIP to generate high-quality pseudo-labels for data enhancement. Experiments demonstrate that our approach significantly outperforms existing baselines in cut-point detection and achieves competitive performance in shot selection. Overall, this work establishes a novel paradigm for multimodal-driven intelligent video editing and provides a practical technical pathway for real-world classical music video production.

Technology Category

Application Category

📝 Abstract
Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.
Problem

Research questions and friction points this paper is trying to address.

Automated video editing for classical music concerts
Determining optimal timing for camera shot transitions
Selecting best camera views from multicamera recordings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal architecture with audio spectrograms and image embeddings
CLIP-based encoder replacing ResNet for spatial selection
Constraining shot selection to same concert segments
🔎 Similar Papers
No similar papers found.
D
Daniel Gonzálbez-Biosca
eHealth Center, Faculty of Computer Science, Multimedia and Telecommunications, Universitat Oberta de Catalunya, Barcelona, Catalonia, Spain
J
Josep Cabacas-Maso
eHealth Center, Faculty of Computer Science, Multimedia and Telecommunications, Universitat Oberta de Catalunya, Barcelona, Catalonia, Spain
Carles Ventura
Carles Ventura
Universitat Oberta de Catalunya (UOC)
Computer visionImage and video segmentation
I
Ismael Benito-Altamirano
eHealth Center, Faculty of Computer Science, Multimedia and Telecommunications, Universitat Oberta de Catalunya, Barcelona, Catalonia, Spain; MIND/IN2UB, Department of Electronic and Biomedical Engineering, Universitat de Barcelona, Barcelona, Catalonia, Spain