JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the longstanding challenge in music–language understanding: the scarcity of high-quality, annotated music data. To this end, we introduce a large-scale, royalty-free music–text dataset comprising over 200,000 instrumental tracks. We propose a synergistic enhancement paradigm—“audio-feature-based retrieval + local large language model (LLM) metadata imputation”—which first generates initial textual captions via cross-modal retrieval using audio embeddings, then employs a fine-tuned local LLM to impute missing metadata (e.g., genre, emotion, instrumentation). Evaluation across five quantitative metrics demonstrates significant improvements in caption relevance and metadata accuracy. The dataset is publicly released to support research in music retrieval, multimodal representation learning, and generative music modeling.

Technology Category

Application Category

📝 Abstract

We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This approach allows us to provide a more comprehensive and informative dataset for researchers working on music-language understanding tasks. We validate this approach quantitatively with five different measurements. By making the JamendoMaxCaps dataset publicly available, we provide a high-quality resource to advance research in music-language understanding tasks such as music retrieval, multimodal representation learning, and generative music models.

Problem

Research questions and friction points this paper is trying to address.

Large-scale music-caption dataset creation

Metadata imputation using local LLM

Advancing music-language understanding research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Imputed metadata enhancement

Local large language model

Music-caption retrieval system

🔎 Similar Papers

No similar papers found.