Music Grounding by Short Video

πŸ“… 2024-08-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the inefficiency of manually editing full-length music tracks for short videos, this paper introduces Music Grounding in Short Videos (MGSV)β€”a novel task that directly localizes semantically and temporally aligned music segments from long audio tracks. We formally define MGSV for the first time, construct the first large-scale benchmark dataset MGSV-EC (53K video–music pairs, 35K annotated music segments), and propose MaDe, an end-to-end joint modeling framework. MaDe unifies cross-modal alignment, temporal localization, and segment regression to jointly optimize feature extraction, matching scoring, and boundary regression. Experiments demonstrate that MaDe significantly outperforms staged baseline methods, validating both the inherent difficulty of MGSV and the effectiveness of our approach on MGSV-EC. All code and data are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Adding proper background music helps complete a short video to be shared. Previous research tackles the task by video-to-music retrieval (V2MR), which aims to find the most suitable music track from a collection to match the content of a given query video. In practice, however, music tracks are typically much longer than the query video, necessitating (manual) trimming of the retrieved music to a shorter segment that matches the video duration. In order to bridge the gap between the practical need for music moment localization and V2MR, we propose a new task termed Music Grounding by Short Video (MGSV). To tackle the new task, we introduce a new benchmark, MGSV-EC, which comprises a diverse set of 53K short videos associated with 35k different music moments from 4k unique music tracks. Furthermore, we develop a new baseline method, MaDe, which performs both video-to-music matching and music moment detection within a unifed end-to-end deep network. Extensive experiments on MGSV-EC not only highlight the challenging nature of MGSV but also sets MaDe as a strong baseline. Data and code will be released.
Problem

Research questions and friction points this paper is trying to address.

Automatically match music to short videos
Localize specific music moments for video duration
Develop a unified model for video-music alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Music Grounding by Short Video (MGSV) task
Introduces MGSV-EC benchmark with 53K videos
Develops MaDe method for end-to-end music matching
πŸ”Ž Similar Papers
No similar papers found.
Zijie Xin
Zijie Xin
Renmin University of China
Video understandingMulti-modal learningCross-modal retrievalComputer Vision
M
Minquan Wang
Kuaishou Technology
J
Jingyu Liu
Q
Quan Chen
Kuaishou Technology
Y
Ye Ma
Kuaishou Technology
P
Peng Jiang
Kuaishou Technology
X
Xirong Li
MoE Key Lab of DEKE, Renmin University of China