🤖 AI Summary
This study addresses automated editing of multi-camera classical concert videos. To tackle this problem, we propose an end-to-end multimodal framework that decouples the two core subtasks: “when to cut” (cut-point detection) and “how to cut” (shot selection). Methodologically, we design a lightweight convolutional-Transformer hybrid architecture to jointly model log-mel spectrograms, CLIP-based visual embeddings, and scalar temporal features. To mitigate cross-performance interference, we introduce intra-concert segment constraints; additionally, we leverage CLIP to generate high-quality pseudo-labels for data enhancement. Experiments demonstrate that our approach significantly outperforms existing baselines in cut-point detection and achieves competitive performance in shot selection. Overall, this work establishes a novel paradigm for multimodal-driven intelligent video editing and provides a practical technical pathway for real-world classical music video production.
📝 Abstract
Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.