Music Grounding by Short Video

📅 2024-08-30

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address the inefficiency of manually editing full-length music tracks for short videos, this paper introduces Music Grounding in Short Videos (MGSV)—a novel task that directly localizes semantically and temporally aligned music segments from long audio tracks. We formally define MGSV for the first time, construct the first large-scale benchmark dataset MGSV-EC (53K video–music pairs, 35K annotated music segments), and propose MaDe, an end-to-end joint modeling framework. MaDe unifies cross-modal alignment, temporal localization, and segment regression to jointly optimize feature extraction, matching scoring, and boundary regression. Experiments demonstrate that MaDe significantly outperforms staged baseline methods, validating both the inherent difficulty of MGSV and the effectiveness of our approach on MGSV-EC. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Adding proper background music helps complete a short video to be shared. Previous research tackles the task by video-to-music retrieval (V2MR), which aims to find the most suitable music track from a collection to match the content of a given query video. In practice, however, music tracks are typically much longer than the query video, necessitating (manual) trimming of the retrieved music to a shorter segment that matches the video duration. In order to bridge the gap between the practical need for music moment localization and V2MR, we propose a new task termed Music Grounding by Short Video (MGSV). To tackle the new task, we introduce a new benchmark, MGSV-EC, which comprises a diverse set of 53K short videos associated with 35k different music moments from 4k unique music tracks. Furthermore, we develop a new baseline method, MaDe, which performs both video-to-music matching and music moment detection within a unifed end-to-end deep network. Extensive experiments on MGSV-EC not only highlight the challenging nature of MGSV but also sets MaDe as a strong baseline. Data and code will be released.

Problem

Research questions and friction points this paper is trying to address.

Automatically match music to short videos

Localize specific music moments for video duration

Develop a unified model for video-music alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Music Grounding by Short Video (MGSV) task

Introduces MGSV-EC benchmark with 53K videos

Develops MaDe method for end-to-end music matching

🔎 Similar Papers

No similar papers found.