When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study addresses automated editing of multi-camera classical concert videos. To tackle this problem, we propose an end-to-end multimodal framework that decouples the two core subtasks: “when to cut” (cut-point detection) and “how to cut” (shot selection). Methodologically, we design a lightweight convolutional-Transformer hybrid architecture to jointly model log-mel spectrograms, CLIP-based visual embeddings, and scalar temporal features. To mitigate cross-performance interference, we introduce intra-concert segment constraints; additionally, we leverage CLIP to generate high-quality pseudo-labels for data enhancement. Experiments demonstrate that our approach significantly outperforms existing baselines in cut-point detection and achieves competitive performance in shot selection. Overall, this work establishes a novel paradigm for multimodal-driven intelligent video editing and provides a practical technical pathway for real-world classical music video production.

Technology Category

Application Category

📝 Abstract

Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.

Problem

Research questions and friction points this paper is trying to address.

Automated video editing for classical music concerts

Determining optimal timing for camera shot transitions

Selecting best camera views from multicamera recordings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal architecture with audio spectrograms and image embeddings

CLIP-based encoder replacing ResNet for spatial selection

Constraining shot selection to same concert segments

🔎 Similar Papers

D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching