π€ AI Summary
To address the inefficiency of manually editing full-length music tracks for short videos, this paper introduces Music Grounding in Short Videos (MGSV)βa novel task that directly localizes semantically and temporally aligned music segments from long audio tracks. We formally define MGSV for the first time, construct the first large-scale benchmark dataset MGSV-EC (53K videoβmusic pairs, 35K annotated music segments), and propose MaDe, an end-to-end joint modeling framework. MaDe unifies cross-modal alignment, temporal localization, and segment regression to jointly optimize feature extraction, matching scoring, and boundary regression. Experiments demonstrate that MaDe significantly outperforms staged baseline methods, validating both the inherent difficulty of MGSV and the effectiveness of our approach on MGSV-EC. All code and data are publicly released.
π Abstract
Adding proper background music helps complete a short video to be shared. Previous research tackles the task by video-to-music retrieval (V2MR), which aims to find the most suitable music track from a collection to match the content of a given query video. In practice, however, music tracks are typically much longer than the query video, necessitating (manual) trimming of the retrieved music to a shorter segment that matches the video duration. In order to bridge the gap between the practical need for music moment localization and V2MR, we propose a new task termed Music Grounding by Short Video (MGSV). To tackle the new task, we introduce a new benchmark, MGSV-EC, which comprises a diverse set of 53K short videos associated with 35k different music moments from 4k unique music tracks. Furthermore, we develop a new baseline method, MaDe, which performs both video-to-music matching and music moment detection within a unifed end-to-end deep network. Extensive experiments on MGSV-EC not only highlight the challenging nature of MGSV but also sets MaDe as a strong baseline. Data and code will be released.