JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding challenge in music–language understanding: the scarcity of high-quality, annotated music data. To this end, we introduce a large-scale, royalty-free music–text dataset comprising over 200,000 instrumental tracks. We propose a synergistic enhancement paradigm—“audio-feature-based retrieval + local large language model (LLM) metadata imputation”—which first generates initial textual captions via cross-modal retrieval using audio embeddings, then employs a fine-tuned local LLM to impute missing metadata (e.g., genre, emotion, instrumentation). Evaluation across five quantitative metrics demonstrates significant improvements in caption relevance and metadata accuracy. The dataset is publicly released to support research in music retrieval, multimodal representation learning, and generative music modeling.

Technology Category

Application Category

📝 Abstract
We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This approach allows us to provide a more comprehensive and informative dataset for researchers working on music-language understanding tasks. We validate this approach quantitatively with five different measurements. By making the JamendoMaxCaps dataset publicly available, we provide a high-quality resource to advance research in music-language understanding tasks such as music retrieval, multimodal representation learning, and generative music models.
Problem

Research questions and friction points this paper is trying to address.

Large-scale music-caption dataset creation
Metadata imputation using local LLM
Advancing music-language understanding research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Imputed metadata enhancement
Local large language model
Music-caption retrieval system
🔎 Similar Papers
No similar papers found.